'I crashed AOL for 19 hours and messed up global email for a week'
The great AOL outage of 1996, explained by a chap who made it happen
Who, me? Welcome again to Who, me? In case you've missed previous editions of the column, it's a confessional in which readers share their stories of having broken stuff. Badly.
This week, a fellow reader named "Bert" asked you to "Cast your mind back to the mid-90s when America Online (AOL) was the biggest online service and the biggest Internet Service Provider. We had a whole five million users!"
Bert was charged with keeping AOL email alive. Which was pretty challenging: on his first day on the job in 1995, he stood up the company’s fourth inbound mail gateway server.
A little over a year later Bert said AOL had "dozens and dozens of inbound Internet mail servers, and I was pulling my hair out trying to figure out how I could keep them all in operation and advertised as available, without causing problems."
Sysadmin shut down the wrong server, and with it all European operationsREAD MORE
MX records were his main problem, because “we had grown to the point where we had so many MXes that our DNS packets were too big to fit into a 512 byte UDP response packet, and were getting truncated.”
Truncated packets caused all sorts of problems, Bert told us, among them badly-out-of-balance loads.
"A small percentage of sites were unable to get mail to us at all, because their systems would refuse to talk to anyone if the DNS response had been truncated. Although a small percentage, it was still a fairly large number. Something clearly had to be done."
So Bert did it.
"I had the bright idea to list multiple IP addresses for a small number of regular names, so that we could take maximum advantage of domain name compression. Thus, a.mx.aol.com was born. And so was b.mx.aol.com, c.mx.aol.com, and so on. By the time of the disaster, we had listed nine names in the DNS as our MXes, with five IP addresses each. All 45 of our mail servers could be listed in the DNS as MXes for AOL and would still fit into a 512-byte UDP response packet – just barely. Mail was flowing again without DNS truncation, and almost everyone was happy."
Those unhappy with Bert pointed out that he really needed a load balancing switch in front of the MXes. But Bert told us that "at the time the only one that existed had a single 10Base-T interface, and we had already graduated to using FDDI on the back-end of our mail servers because there was so much traffic."
"There was no way we could have crammed all that traffic through a tiny 10Mbps Ethernet port." So he didn’t.
But Bert did arrange for the AOL DNS team to have a backup off-site nameservice provider, "in case there was an extended outage so that people could still look up our information. This was also done by ANS."
The best-laid plans can go off the rails and Bert's did on August 7, 1996.
"Due to some mis-calculations and mis-communications between the AOL Network staff and subsidiary/backbone service provider ANS, we suffered a 19-hour network outage."
The outage was caused by the fact that "when you tried to deliver the first mail message in your queue, you would look up our MXes, get back the list of 45 IP addresses, and start trying to make connections – just like normal. You'd try to connect to the first IP address of the first name, and ... wait. You'd wait for two minutes for that connection to timeout, as per the RFC. Then you'd try the next IP address. And wait. For two minutes. Then you'd try the next IP address."
"Do the math[s]," Bert suggested, because "this works out to 90 minutes to try to deliver a single mail message. Meanwhile, that mail process isn't doing anything else."
But people all over the Internet were doing something else: sending more emails. And because those emails couldn’t reach AOL – which, remember was a huge slice of the internet population in 1996 – email servers would “would fire off a queue runner every sixty or thirty minutes (by default), to go flush the queue. That queue runner would suck in the entire queue, and then go try and process each and every message, one by one."
"But if it got stuck for 90 minutes while trying to process a single message to AOL, then another queue runner would get started after thirty or sixty minutes, it would suck in the entire queue, be unable to process the one message you were already working on, but then it would probably also get stuck for 90 minutes while trying to send a second message to AOL."
Before long, servers would be out of RAM, start swapping data to disk, run out of swap space on disk and then – again, remember this 1996 when servers were feeble – the server would probably reboot, try to fix its filesystem and then start sending mails again. And because the mails couldn’t get in to AOL … you get the picture.
Bert told us that his errors led to hate mail from noted distributors of unsolicited commercial email, who were kind enough to share his phone number with the world and suggest he’d appreciate an angry call. Some of those made it through and Bert told us he thinks he received a death threat or two.
He assumes he got plenty of angry emails too but was blissfully unaware of just how many people hated him – because their emails couldn’t get through!
"I don't remember how many businesses I heard about that failed that day, because they weren't able to get to their e-mail and respond to a critical message that they had been sent," Bert told Who, me? AOL came back online after 19 hours. Bert told us that "it took about a week for mail traffic to ultimately return to normal" as the world waited for "the whole rest of the Internet to catch up on all their email backlogs."
Bert reckons some good came of the incident, because among the responses were development of the
postfix mail transfer agents that made the global email systems rather more resilient.
It's hard to top Bert's story for sheer scale, but we're sure other readers will have fine tales of their mistakes to share. If you've such a tale, click here to contact us and you might pop up on a future Monday! ®