Bitbucket's Amazon DDoS - what went wrong
A cautionary cloud tale
After a DDoS brought down Bitbucket's web-based code-hosting service for more than 19 hours over the weekend, Jesper Nøhr speculated the attack had exposed a flaw in the sky-high Amazon infrastructure that hosts the site. Nøhr - who runs Bitbucket - has since spoken to an "Amazon executive" about the attack, and according to his account of the conversation, his earlier speculation was right on the money.
Bitbucket runs its entire site on Amazon's Elastic Compute Cloud (EC2), which provides scalable processing resources, and it uses Amazon's Elastic Block Store (EBS) to store its database, log files, user data, and more. EBS provides persistent storage for EC2 server instances. The problem, according to Jesper Nøhr, is that the storage system operates on a network channel that's exposed to the outside internet.
Bitbucket's Amazon setup worked well enough until late last Friday, when Nøhr realized EBS was "virtually unavailable." The outage persisted for more than 16 hours, in part because both Nøhr and Amazon's support reps assumed there was some sort of problem with EBS. According to Nøhr, the first rep he spoke to attributed the slowdown to the fact that EBS is a shared resource used by other bandwidth-hungry Amazon customers. In a statement sent to The Reg, Amazon gives a similar story.
"Over the weekend, one of our customers reported a problem with their Amazon Elastic Block Store (EBS)," the statement reads. "This issue was limited to this customer’s single Amazon EBS volume and other customers were not affected. We did not immediately look beyond the reported problem and spent too much time focusing on what was believed to be an issue with the Amazon EBS volume."
But as it turns out, Bitbucket's Amazonian infrastructure had been DDoSed. "We were attacked. Bigtime. We had a massive flood of UDP [User Datagram Protocol] packets coming in to our IP, basically eating away all bandwidth to the box," Nøhr wrote on his blog. "So, basically a massive-scale DDOS. That’s nice."
Once the cause of the problem was determined - more than 16 hours after the attack started - Amazon blocked the offending traffic, and things were soon back to normal. But Nøhr - and so many other netizens who followed the story - couldn't understand why a DDoS attack tied up what should have been "internal" storage resources.
Nøhr guessed that Bitbucket's storage sits on the same network interface that connects the site to the outside world, and according to Nøhr, this has been confirmed by Amazon. "We were speculating whether all the traffic was on the same interface, and [the Amazon EC2 executive] told us this was true," Nøhr told The Reg.
According to Nøhr, Amazon also told him that the company Quality of Service technology - meant to prioritize the storage traffic - did not work as the company expected. "They said they were supposed to prioritize EBS traffic over other traffic so we wouldn't be bogged down by external traffic," Nøhr says. "But they admitted it wasn't working the way they wanted it to."
Amazon has not responded to a request for comment on this specific issue. But an earlier statement from the company doesn't contradict what Nøhr has said.
"What we ultimately found was not a problem with Amazon EBS, but rather that the customer’s Amazon EC2 instance was receiving a very large amount of network traffic," the statement reads. "This large flood of traffic overwhelmed the networking of the customer’s single Amazon EC2 instance and caused performance to degrade on all I/O operations on the instance. Once we properly diagnosed the problem, we worked with the customer to put measures in place to help mitigate the unwanted traffic they were receiving."
Like many, Scott Morrison - chief architect and VP of engineering at Layer 7, a company that offers an outside security solution for Amazon's so-called cloud - finds it rather hard to believe that Amazon would put EBS on an outside net connection. "It seems like [EBS] shouldn't be externally accessible," he tells The Reg. "It's bizarre. That's sort of like making NFS mounts accessible outside your firewall - something you would never do."
The other problem with Amazon's setup, according to Jesper Nøhr, is that customers like him have no way of viewing the DDoS traffic hitting their sites - i.e. they have no way of identifying an attack. What's more, he says, Amazon told him that even the "Gold" support reps he initially spoke to didn't have a way of viewing the traffic.
"[Amazon] said that there is a department at Amazon that monitors such traffic, but [Amazon] said the first line of support can't see it," Nøhr says. "In short, you can't really see into the problem, because Amazon's Web Services is kind of a black box."
None too surprisingly, Layer 7's Scott Morrison calls this "a huge problem." Again, Amazon did not respond to a request for comment on this particular issue.
On Friday, Nøhr payed $400 to get access to Gold support. And to Amazon's credit, it has told Nøhr it will refund the money. And though he questions Amazon's setup, he feels that the company ultimately responded quite well to the problem. "Amazon has been very transparent with us and very apologetic. I don't want their name to be dragged through the mud."
Amazon does tell The Reg that such an attack may have been avoided if Bitbucket had been using additional Amazon services, such as the recently announced Elastic Load Balancing and Auto-Scaling. And Nøhr says the company told him much the same.
Nøhr says the company also told him that in the future, it would provide additional information about web traffic to customers and support personnel in an effort to better identify such attacks.
Nonetheless, says Layer 7's Scott Morrison, all this should serve as a cautionary tale for those eyeing the, um, cloud. "This is exactly what people have been warning about in the cloud for a while," he says. "Sure enough, here is the perfect example." ®