What the ATO’s HPE Failures Teach Us About Disaster Recovery

What Happened at the ATO

The Australian Taxation Office (ATO) experienced a multiple day outage, caused by the failure of new HPE 3PAR storage units.

Hewlett Packard Enterprise (HPE) 3PAR systems utilise solid state flash storage that takes advantage of virtualisation and cloud resources to promise faster processing speeds. In a fast data-processing environment like the ATO, solid state storage drives (SSDs) tend to be more beneficial than hard disk drives (HDDs) due to their ability to transfer files quickly and better accommodate greater bandwidth. SSDs are also known to better withstand physical damage, high shock and extreme temperature environments.

In November 2015, the ATO migrated core infrastructure management systems, transferring their data storage capability to a new HPE 3PAR SAN (storage area network) in Sydney. A SAN is a specialised, high-speed network, often used to improve application availability, performance, and data storage and security. SANs typically play an important role in an organisation’s business continuity.

Unfortunately, in December 2016, the ATO’s disaster recovery plan failed. One petabyte of data was lost because the automatic failover to their second SAN didn’t come online. (To add some perspective – 1 petabyte is equal to 1,024 terabytes or one million gigabytes). On top of that, corrupted storage blocks on the main SAN had been faithfully copied to the second SAN.

Luckily, the ATO had another backup source, so the data loss was not total or permanent. However, a second five day outage occurred in February as a result of the ATO’s efforts to stabilise the faulty SAN. Online tax services had to be taken down for several weekends during April and May, as the old SAN was swapped out for new equipment.

How Did the ATO’s System Failures Occur?

At first, the original failure seemed due to either defective firmware or human error. More speculation further pointed to human error, with HPE believing that cabling was damaged when the arrays were moved. The official report confirms the fibre cables used were not “optimally fitted”, the drives used in the SAN had software bugs that made data inaccessible, and monitoring features were not turned on.

Most concernedly is that analysis of the SAN log data preceding the December 2016 incident indicated potential issues. Since May 2016, at least 77 events related to components were observed to fail, and at least 159 alerts were recorded in SAN device monitoring and management logs. The ATO had no direct access to the SANs, and had been kept somewhat in the dark over how serious the SAN configuration issues were.

The Aftermath of the ATO’s Disasters

These long outages were costly, especially for accountants. While HPE recompensed the ATO with new systems, accountants who charge billable hours and were left unable to work for days have been left out of compensation.

The impact of the outages varied significantly, depending on the level of compliance work they performed. Accountants have not been able to make claims over the system outages, with many left frustrated by the ATO’s digital disruption.

An echoed concern the accounting community has is the steady decline of the ATO’s systems. The ATO may have been operating out of naivete, but their users were feeling the system’s weaknesses. The biggest complaint has been that month after month, the ATO’s systems have been becoming more unreliable and causing unavoidable inefficiencies.

Chris Jordan, Commissioner of Taxation, has stated the ATO’s “business continuity mechanisms, communication and engagement…into the future need to be more inclusive of our partners, such as software providers, tax professionals, and the superannuation industry.” In the official report, he also apologised for the inconvenience and expressed his confidence in Tax Time 2017.

Since February, the ATO has taken steps toward improving its internal IT systems, rebooting its mainframe and remedying their running of applications. They’ve continued to experience more intermittent outages, even as recently as last month. These outages are apparently unrelated to hardware issues or its storage area network, but are still frustrating to businesses and taxpayers.

The Real Cost of Downtime

We often think about the cost of downtime in dollars – how many thousands it costs your business per minute. However, downtime often costs something much more important: trust and reputation.

IT faults have undermined the ATO’s reputation, forcing tax professionals to consistently experience intermittent disruptions. These technical problems result in lost productivity, leaving some staff completely unable to help clients for a full day. Thus the cost of downtime extends past the ATO’s reputation, to the reputations of accountants that are unable to service their small business clients.

Depending on your business type, your downtime can also end up meaning costly downtime for your clients. And a poor reputation can be just as difficult, if not more so, to recover from than lost data or a systems outage.

If your business depends on other businesses to perform your services, you can and should still create a continuity plan to increase redundancy and minimise potential downtime in whichever ways possible. Vice versa – if your business is one other businesses depend on, a disaster recovery and data backup plan is critical to limit or eliminate disruption for both you and your customers.

Related Posts