The Outage At Delta Airlines Is A Lesson In How NOT To Manage Critical Systems

The Outage At Delta Airlines Is A Lesson In How NOT To Manage Critical Systems

airplane

Earlier this month, Delta airlines suffered an outage that left passengers grounded worldwide. After blundering through their own systems for a bit, Delta finally stepped forward and announced that the outage was caused by a power failure in a Verizon data center. Yes, you read that right – losing power in a single data center brought down an entire airline.

Baffling, isn’t it?

“Monday morning a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power,” explained Delta COO Giles West, in an interview with The Week. “When this happened, critical systems and network equipment didn’t switch over to backups. Other systems did. And now we’re seeing instability in these systems,”

This incident embodies domino effect in the worst possible way. A malfunctioning piece of equipment caused a power surge. That power surge tripped a transformer, which just so happened to be supplying power to Delta’s command center in Atlanta.

And although Delta evidently had backup systems in place, although they planned for failover…for one reason or another, it didn’t happen. When the Atlanta facility was knocked offline, the airline’s passenger information system, instead of switching over to a backup server, simply…died.

“How could a company as technologically savvy and mature with its business processes as Delta not have a working disaster recovery plan? It’s a fair question, and we are still waiting to learn the details,” writes Seacoast Online’s MJ Shoer. “I can’t for a minute believe Delta does not have a disaster recovery plan to deal with an event like this, but it failed. That begs the question as to when it was last tested and how often this plan is reviewed, revised and retested.”

It should also bring up the question of whether or not your disaster recovery plan is up to date – when’s the last time you tested, revised, or reviewed it?

Don’t get too critical of Delta if you can’t answer that question. You could easily be in their position somewhere down the line. Delta’s large enough to absorb the losses from this downtime, at least to an extent – is your business?

Of course, a bad disaster recovery plan isn’t the only reason this fiasco happened. Delta reportedly outsources virtually all of its IT services, a fact which in this case hurt it immensely – the majority of its IT team may have been offshore, and unable to physically access the Atlanta data center.  You can imagine how that might be an issue.

“If the IT support team is thousands of miles away, the process for restarting hundreds — perhaps thousands — of systems can be slow and painful,” writes Robert Cringely of Beta News. “If you lose the data link between your support team and the data center due to that same power outage your support team can do nothing until the data link is fixed.”

Basically, there are two lessons here. First, ensure you’ve a working, tested, and vetted disaster recovery plan in place. Second, if you’re running your own data center, make sure you’ve enough staff onsite to deal with any problems that might arise.


Subscribe with Feedly