The system outage that left thousands of Delta Air Lines passengers around the world facing flight cancellations and delays on Monday shows how computer-dependent society has become — and airlines have to decide if their backup technologies are good enough to deal with that reality, a Canadian computer networking expert says.
"We've kind of painted ourselves into a corner where we must rely on computer systems," said Srinivasan Keshav, a professor of computer science at the University of Waterloo.
"[But] we have now been able to build systems which are very tolerant of losses, of parts of the system being taken down."
- 'What a nightmare': Nearly 250 flights cancelled next day as Delta recovers from system outage
- 650 Delta flights cancelled after worldwide outage
The key, Keshav said, is to adopt the model that technology leaders like Google have — known as "system fault tolerance," which assumes any single component in a computer network can fail at any time, but it doesn't matter because there are multiple backup measures in place at every level of the system.
"Failures are not exceptions. Failures are kind of normal," Keshav said, noting that companies like Google or Amazon have dozens of servers "dying every day," but with upward of 100,000 servers on hand, the systems don't crash.
Power outage a 'surprising' cause
Delta Air Lines said the cause of Monday's mess was a power outage at its base in Atlanta, Ga., at around 2:30 a.m. ET. In a statement posted online Monday afternoon, the airline said systems were once again "fully operational" and flights had "resumed hours ago but delays and cancellations remain as recovery efforts continue."
The fact that a power outage was to blame is "surprising," Keshav said, because "it's the one thing you wouldn't expect to have happen because that's easy to get right."
Airline data centres usually have two layers of backup — diesel generators and batteries — to protect "critical systems," he added.
An update from Delta CEO Ed Bastian: pic.twitter.com/udNN0kzbKs— @Delta
"When you look at a complex computer system such as the one that Delta runs, there's many layers of the cake, so to speak. At the bottom is power," Keshav said.
Mark Duell, vice-president of operations for the global aviation tracking website FlightAware, said airlines "go to great lengths" to make sure backup systems, including several power sources, are in place in their data centres.
"Everything from bringing in power from the utility on opposite literal sides of the building, just so [a] single backhoe can't take them both out at the same time; having more generators than they need so that they don't need all the generators to be operable; having... multiple battery backup systems internally to cover everything until the generators come online," Duell said.
"And then down to the point of literally each computer, each server in the data centre is plugged into two different power strips and has two power supplies that are redundant."
Although he doesn't know specifically what happened in Delta's case, Duell said it was likely that the problem extended beyond a basic utility failure, since the batteries and generator backups should have kicked in.
"It was probably more than one failure," he said.
Safety not at risk
Both Duell and Keshav emphasized that the computer system outage would not have posed a risk to passengers in flight.
"The airplane is entirely independent of the ground in terms of continuing to fly," Duell said.
That's because airlines use "decoupling" in computer system design, Keshav said, meaning systems involved in actually operating the aircraft are independent from other systems like reservations or flight schedules.
The reason a system outage like this one has such an impact, Duell said, is because airlines stop and cancel flights for safety reasons when they can't get access to important computerized information like passenger counts, how much baggage has been checked or fuelling records.
"You run into those sorts of dependencies where they can't move things, but anything already moving is not in any real danger," he said.
'Critically examine' infrastructure
Delta isn't the only airline to have experienced a recent system failure.
Last month, Southwest Airlines cancelled more than 2,000 flights over several days after an outage that it blamed on a faulty network router.
United Airlines has suffered a series of delays since it merged with Continental as the technological systems of the two airlines clashed.
"It's something that happens from time to time," Duell said. "There's no particular airline that is immune to these [problems], and from what we've seen, there's none that are particularly prone to these."
Although Keshav doesn't know what measures specific airlines have already taken, such large-scale failures could be prevented if they invest in rigorous systems "that tolerate fault and assume faults are going to happen."
But that would entail expensive and complex engineering, requiring the replacement of legacy systems built years ago, he said.
"Banks, airlines, things like that which have been around for a while... need to at some point critically examine their infrastructure."