Secret IBM script could have prevented 11 hour US tax day outage

The April 2018 US tax day outage was due to a faulty IBM disk array and could have been avoided with a secret IBM script.


On-line tax filing was held up for 11 hours on the last filing day of the 2018 tax year, and the IRS had to extend the filing period by another day.

The tax filing system is mainframe-based and uses several high-availability disk arrays, with Unisys and IBM as primary and secondary contractors  respectively, under the IRS Enterprise Storage Services (ESS) 10 year agreement, signed in 2012.

According to a US government report, one of these suffered a deadlock condition after a warmstart due to a cache overflow, alerted the IRS admin staff at 2.24am, and sent a call-home alert message to IBM at 2.57am.

Amazingly it was classed as a Severity Level 3 alert, with a response time due by the end of the next business day.

More IRS systems were affected by 3.30am and a growing tidal wave of affected systems hit the IRS. By 7.45am 79 systems  were screwed and a major outage  was declared at 9.45am. A remediating script was developed by 1.40pm, limited tax return filing started at 3pm and full filing resumed at 5pm.

The root cause firmware bug was discovered by IBM nine months earlier, in June 2017, with a microcode fix made available in November 2017. But Unisys recommended that the fix should not be applied during the 2018 tax year filing period because it had not been tested enough. The IRS agreed.

What neither knew was that another IBM customer had experienced the same bug four months before the IRS outage, in January 2018, and IBM had developed and deployed a preventative script which fixed it. But Big Blue told neither the IRS nor Unisys about this.

The report calls into question some of the decisions made by the IRS and its contractors. First, the IRS tax filing system, classed as a Tier 1 storage environment, does not have an automatic failover or built-in redundancies and is currently a single point of failure. This is now being fixed.

Secondly, Unisys failed to meet several service level objectives (SLO) on the outage day:

Unisys_SLO_IRS_foulup

The IRS has received damages over this. All-in-all the tax day outage was a sorry tale of human error, inadequate procedures and being bitten on the ass by a system’s single point off failure. B&F