Feeds

Microsoft's FDS data-sorter crushes Hadoop

Getting Bing to sit up and bark

Gartner critical capabilities for enterprise endpoint backup

Techies at Microsoft Research, the big brain arm of the software goliath, have taken the crown in the sorting benchmark world. The researchers are thinking about how to implement new sorting algorithms in the Bing search engine to give Microsoft a leg up on the MapReduce algorithms that underpin Google's search engine and other big data-munching applications.

In the data world, the benchmarks against which you measure your performance are collectively known as the Sort Benchmarks. The two original benchmarks proposed by techies in 1994 at Digital Equipment – led by Jim Gray, who administered the Sort Benchmarks until he was lost at sea in 2007 – were collectively known as MinuteSort and PennySort, and they are still used today. (Gray originally worked on the VAX and AlphaServer lines at DEC, but eventually moved to Microsoft.)

The MinuteSort test counts how many bytes of data you can sort using the benchmark code in 60 seconds and the PennySort test measures the amount of data you can sort for a penny's worth of system time on a machine or cluster set up to run the test. There used to be other tests, such as the TeraByte Sort, which was the amount of time it took to sort through 1TB of data, but as servers, storage, and networks have progressed, the MinuteSort has become the new TeraByte Sort. Or at least now it has, with the Microsoft team breaking through the terabyte barrier with their Flat Datacenter Storage scheme.

Microsoft has just stomped the living daylights out of a Hadoop cluster that was the previous record-holder on the MinuteSort test, and did so by substantially beefing up the network capacity between server nodes and storage and essentially chucking the whole MapReduce approach to data munching out the window.

With Hadoop and its MapReduce approach, you have data scattered and replicated around the cluster for both availability and algorithmic reasons, and you dispatch the computing to the server nodes where you need to process data – instead of trying to move data from a particular piece of storage to a server. This approach is what allows search engines like those developed by Google, Yahoo!, and Microsoft (which used to be distinct) to mine massive amounts of clickstream data to serve you the most appropriate web pages and advertisements. But Hadoop's scalability is only about 4,000 nodes and it is a batch-oriented program, not something that looks and feels like real time.

Microsoft is not releasing the details of its Flat Datacenter Storage approach yet; it may never do so because it gives the company a competitive advantage. But the company did provide some clues to how it was able to beat a Hadoop cluster configured by Yahoo!, which cloned Google's 2004 MapReduce methodology and file system to create Hadoop and the Hadoop Distributed File System, by nearly a factor of three in terms of performance – and, according to a blog post, using one-sixth the number of servers.

The Flat Datacenter Storage effort is headed up by Jeremy Elson, who works in the Distributed Systems and Networking Group at Microsoft Research. The MinuteSort run that Elson's team ran on 250 machines configured with 1,033 disk drives was able to rip through and sort 1,401 gigabytes of data in 60 seconds, handily beating a Yahoo! Hadoop configuration from 2009 that had 1,406 nodes and 5,624 disks that could process 500GB in a minute.

It is not clear that Yahoo! could have fielded a better result using a more modern version of the Apache Hadoop software and shiny new x86 iron. (By the way, this sort was on the "Daytona" version of the benchmark, which is based on using stock code, not the "Indy" version of the test, which allows for more exotic algorithms. A team at the University of California San Diego won the Indy MinuteSort race last year with a 52 node cluster of HP ProLiant DL360 G6 servers and a Cisco Systems Nexus 5096 switch. This TritonSort machine at UCSD (PDF) was able to sort 1,353GB of data in 60 seconds.)

To get its speed, the Flat Datacenter Storage team grabbed another technology from Microsoft Research, called full bisection bandwidth networks, and specifically, each node in the cluster could transmit data at 2Gb/sec and receive data at 2Gb/sec without interruption. "That’s 20 times as much bandwidth as most computers in data centers have today," Elson explained in the blog, "and harnessing it required novel techniques".

And the Daytona car beat the Indy car this time around.

Microsoft used an unnamed remote file system that was linked to that full bisection bandwidth network to feed data to all of the nodes in the cluster to run the MinuteSort test, which is the way such sorting benchmarks were done before the MapReduce method came along. MapReduce is great for certain kinds of data-munching, like when a set of data can fit inside of a single server node. But, as Elson points out, what happens when you have two very large data sets that you want to merge and then chew on? How do you do that on MapReduce? The data has to move, and it has to move to somewhere that the systems doing the sort can get access at very high speeds.

The Microsoft Research team that developed the Fast Datacenter Storage algorithm is presenting its results at the 2012 SIGMOD/PODS Conference in Scottsdale, Arizona this week.

The research behind the new sorting method was sponsored in part by Microsoft's Bing team because it can be applied to search engine results as well as to gene sequencing and stitching together aerial photographs. The company is pretty keyed up that it can get a factor of 16 improvement in the efficiency of sorting per server using a remote (and unnamed file system) compared to Hadoop and its HDFS. ®

Secure remote control for conventional and virtual desktops

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Object storage bods Exablox: RAID is dead, baby. RAID is dead
Bring your own disks to its object appliances
Nimble's latest mutants GORGE themselves on unlucky forerunners
Crossing Sandy Bridges without stopping for breath
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 10 endpoint backup mistakes
Avoid the ten endpoint backup mistakes to ensure that your critical corporate data is protected and end user productivity is improved.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Rethinking backup and recovery in the modern data center
Combining intelligence, operational analytics, and automation to enable efficient, data-driven IT organizations using the HP ABR approach.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.