Google Research: Three things that MUST BE DONE to save the data center of the future
Think data-center design is tough now? Just you wait
'We suck at microseconds'
The third of the three challenges that Barroso identified as hampering the development of highly responsive, massively scaled data center is perhaps a bit counterintuitive: microsecond computing.
Addressing his audience of ISSCC chip designers, he said "You guys here in this room are the gods of nanosecond computing – or maybe picosecond computing." Over the years, a whole raft of techniques have been developed to deal with latencies in the nanosecond range.
But there remain a number of problems with latencies of much longer periods of time: microsecond latencies. The internet or disk drives, for example, induce latencies at the millisecond level, and so far the industry has been able to deal with these latencies by providing context-switching at the five-to-seven microsecond level. No problem, really.
However, Barroso said, "Here we are today, and I would propose that most of the interesting devices that we deal with in our 'landheld' computers are not in the nanosecond level, they're not in the millisecond level – they're in the microsecond level. And we suck at microseconds."
From his point of view the reason that the industry sucks at handling microsecond latencies is simply because it hasn't been paying attention to them while they've been focused on the nanosecond and millisecond levels of latencies.
As an example of a microsecond latency in a mega–datacenter, he gave the example of the data center itself. "Think about it. Two machines communicating in one of these large facilities. If the fiber has to go 200 meters or so, you have a microsecond." Add switching to that, and you have a bit more than a microsecond.
Multiply those microsecond latencies by the enormous amount of communications among machines and switches, and you're talking a large aggregate sum. In regard to flash storage – not to mention the higher-speed, denser non-volatile memory technologies of the future – you're going to see more microsecond latencies that need to be dealt with in the mega–data center.
The hardware-software solution
"Where this breaks down," Barroso said, "is when people today at Google and other companies try to build very efficient, say, messaging systems to deal with microsecond-level latencies in data centers."
The problem today is that when programmers want to send a call to a system one microsecond away and have it respond with data that takes another microsecond to return, they use remote procedure call libraries or messaging libraries when they want to, say, perform an RDMA call in a distributed system rather than use a direct RDMA operation.
When using such a library, he said, "Those two microseconds quickly went to almost a hundred microseconds." Admittedly, some of that problem is that software is often unnecessarily bloated, but Barroso says that the main reason is that "we don't have the underlying mechanisms that make it easy for programmers to deal with microsecond-level latencies."
This is a problem that will have to be dealt with both at the hardware and the software levels, he said – and by the industry promoting microsecond-level latencies to a first-order problem when designing systems for the mega–data center.
All three of these challenges are about creating highly scalable data centers that can accomplish the goal of "big data, little time" – but despite the fact that he was speaking at an ISSCC session during which all the other presenters spoke at length about the buzzword du jour, Barroso refused to be drawn into the hype-fest.
"I will not talk about 'big data' per se," he said. "Or to use the Google internal term for it: 'data'." ®