Secret IBM script could have prevented 11-hour US tax day outage
Two chances missed to swerve mainframe drive array bug
The April 2018 US tax day outage was due to a faulty IBM disk array and could have been avoided twice – first with a more up-to-date microcode bundle, and second with a secret IBM script.
Online tax filing was held up for 11 hours on the last filing day of the 2018 tax year, April 17, and the IRS had to extend the filing period by another day.
The tax filing system is mainframe-based and uses several high-availability disk arrays, with Unisys as the primary contractor, and IBM a secondary one, under an Enterprise Storage Services (ESS) contract.
It's US Tax Day, so of course the IRS's servers have taken a swan diveREAD MORE
According to a US government report out this month, one of these suffered a deadlock condition after a "warm start" (aka warm boot) due to a cache overflow, alerted the IRS admin staff at 0224 Eastern Standard Time (EST), and sent a call-home alert message to IBM at 0257 EST on April 17.
Amazingly, it was classed as being a Severity Level 3 alert, with a response time due by the end of the next business day.
More IRS systems were affected by 0330 EST, and a growing tidal wave of affected systems hit the IRS – with 59 systems screwed by 0745 EST, and a "major outage" declared by 0945 EST. A remediating script was developed by 1340 EST, limited tax return filing was started at 1500 and full filing resumed at 1700.
The root cause firmware bug was actually discovered by IBM nine months earlier, in June 2017, with a microcode fix, Microcode Bundle 18.104.22.168 released to the public on 7 November 2017.
Why didn't the IRS patch?
No one comes out of this report looking good.
IRS techies at its Information Technology Organization hold meetings every month with primary contractor Unisys and IBM to discuss current microcode bundles for the IRS mainframes. But, according to the report, Unisys recommended that 22.214.171.124 not be applied during the 2018 tax year filing period because it had not been tested enough.
Not without reason, Unisys apparently had an "informal" policy that required a bundle have had "450 machine weeks* in a production environment" prior to installation on IRS equipment.
The IRS agreed to use an earlier bundle, 126.96.36.199, which was considered "more stable".
However, one month after that meeting, in January 2018, and four months before the IRS outage, another IBM customer experienced the same bug. IBM developed and deployed a preventative script which fixed it. But Big Blue told neither the IRS nor Unisys about this.
Single point of failure
The report also touched on a few other points which make some IRS and contractor IT decisions seem inadequate. Firstly, the IRS tax filing system, classed as a Tier 1 storage environment, did not have an automatic failover or built-in redundancies and was currently a single point of failure. This is now being fixed.
Secondly, the contractor (Unisys) failed to meet several service level objectives (SLO) on the outage day:
The report recommended that the IRS formalise the monthly microcode bundle meetings (there were no meeting minutes or documentation of decisions made of the November meeting), seek damages from Enterprise Storage Services contractor Unisys and make tweaks to its contract.
All-in-all the tax day outage was a sorry tale of human error, inadequate procedures and being bitten on the ass by a system’s single point of failure. ®
* Weeks the code had been running multiplied by the number of boxes it is installed on.