Router crash downs CloudFlare services
A lesson in disclosure
During Sunday, US time, prominent Web services outfit CloudFlare sent an instruction to its routers in response to an attempted DoS, and instead took down its own network.
In a rare example of detailed disclosure, the company has posted an explanation of what happened here.
The network collapse occurred, the company explains, after it detected an attempted denial-of-service attack against a customer’s DNS servers using packets that were between 99,971 and 99,985 bytes long – an oddity, CloudFlare notes, because that’s so much larger than the Internet’s typical packet length (500 – 600 bytes according to the company) and larger than the 4,470 byte maximum packet it allows on its internal network.
So it wrote a JunOS rule (CloudFlare is a Juniper shop) to drop the packets, propagated the rule to its routers – and for reasons unknown, that rule crashed all the routers at which the instruction arrived.
“Flowspec accepted the rule and relayed it to our edge network. What should have happened is that no packet should have matched that rule because no packet was actually that large. What happened instead is that the routers encountered the rule and then proceeded to consume all their RAM until they crashed,” the blog post notes.
The crashes happened in such a way, CloudFlare says, that the routers didn’t reboot automatically, which meant that they couldn’t be accessed remotely; and worse, those routers that did wake back up copped the entire traffic load, couldn’t cope, and crashed again.
Accounts covered by SLAs will get credits, the company says, and it is investigating the problem with Juniper. ®
I think someone meant to do that.
Chris, I'll make you a bet, the packets weren't really "between 99,971 and 99,985 bytes long", they just had header fields saying they were, they sort of say as much when they say no packet should have matched the rule because no packets were actually that long, and that range of lengths was picked because the attacker knew a rule blocking them would crash the routers badly.
Re: Bill Re: test?
".......maybe they didn’t have time to go through full testing?......" I've seen similar mistakes, usually they are a combination of management pressure - "fix that NOW" - and over-confidence in one's own ability. Many, many moons ago, there was a rumour of a ping of death for CISCO Catalyst routers (5000 models IIRC) and much argument amongst netties as to whether it would work or not. At company I was working for at the time, our network architect, having the authority to do as he pleased, was firmly in the "it-won't-work" camp and decided to test it against one of our routers, only to find not only did it work but it also propagated through all the same models in the network. Cue embarrassing and company-wide network outage which we definitely did not step up and explain to the customers!
Except this company sell a CDN product that is supposed to relieve stress on servers when they are under DoS and provide (and I quote) "Always Online™" and "Rock solid reliability" so that even if your server goes down, your visitors can still see your content.
So it's a bit embarrassing to not test, to just roll out, and not have an adequate testing procedure (I mean, rolling it out to all your routers before you notice is a bit stupid, no matter what).
And I can attest that at least one site I'm aware of was down for quite a long time despite the fact that it uses CloudFlare CDN to keep itself online "no matter what" and was returning all sorts of errors even though the underlying origin servers were up. Next time, their accountants will be telling them to test before they deploy, I think.