Feeds

Microsoft's FDS data-sorter crushes Hadoop

Getting Bing to sit up and bark

Reducing the cost and complexity of web vulnerability management

Techies at Microsoft Research, the big brain arm of the software goliath, have taken the crown in the sorting benchmark world. The researchers are thinking about how to implement new sorting algorithms in the Bing search engine to give Microsoft a leg up on the MapReduce algorithms that underpin Google's search engine and other big data-munching applications.

In the data world, the benchmarks against which you measure your performance are collectively known as the Sort Benchmarks. The two original benchmarks proposed by techies in 1994 at Digital Equipment – led by Jim Gray, who administered the Sort Benchmarks until he was lost at sea in 2007 – were collectively known as MinuteSort and PennySort, and they are still used today. (Gray originally worked on the VAX and AlphaServer lines at DEC, but eventually moved to Microsoft.)

The MinuteSort test counts how many bytes of data you can sort using the benchmark code in 60 seconds and the PennySort test measures the amount of data you can sort for a penny's worth of system time on a machine or cluster set up to run the test. There used to be other tests, such as the TeraByte Sort, which was the amount of time it took to sort through 1TB of data, but as servers, storage, and networks have progressed, the MinuteSort has become the new TeraByte Sort. Or at least now it has, with the Microsoft team breaking through the terabyte barrier with their Flat Datacenter Storage scheme.

Microsoft has just stomped the living daylights out of a Hadoop cluster that was the previous record-holder on the MinuteSort test, and did so by substantially beefing up the network capacity between server nodes and storage and essentially chucking the whole MapReduce approach to data munching out the window.

With Hadoop and its MapReduce approach, you have data scattered and replicated around the cluster for both availability and algorithmic reasons, and you dispatch the computing to the server nodes where you need to process data – instead of trying to move data from a particular piece of storage to a server. This approach is what allows search engines like those developed by Google, Yahoo!, and Microsoft (which used to be distinct) to mine massive amounts of clickstream data to serve you the most appropriate web pages and advertisements. But Hadoop's scalability is only about 4,000 nodes and it is a batch-oriented program, not something that looks and feels like real time.

Microsoft is not releasing the details of its Flat Datacenter Storage approach yet; it may never do so because it gives the company a competitive advantage. But the company did provide some clues to how it was able to beat a Hadoop cluster configured by Yahoo!, which cloned Google's 2004 MapReduce methodology and file system to create Hadoop and the Hadoop Distributed File System, by nearly a factor of three in terms of performance – and, according to a blog post, using one-sixth the number of servers.

The Flat Datacenter Storage effort is headed up by Jeremy Elson, who works in the Distributed Systems and Networking Group at Microsoft Research. The MinuteSort run that Elson's team ran on 250 machines configured with 1,033 disk drives was able to rip through and sort 1,401 gigabytes of data in 60 seconds, handily beating a Yahoo! Hadoop configuration from 2009 that had 1,406 nodes and 5,624 disks that could process 500GB in a minute.

It is not clear that Yahoo! could have fielded a better result using a more modern version of the Apache Hadoop software and shiny new x86 iron. (By the way, this sort was on the "Daytona" version of the benchmark, which is based on using stock code, not the "Indy" version of the test, which allows for more exotic algorithms. A team at the University of California San Diego won the Indy MinuteSort race last year with a 52 node cluster of HP ProLiant DL360 G6 servers and a Cisco Systems Nexus 5096 switch. This TritonSort machine at UCSD (PDF) was able to sort 1,353GB of data in 60 seconds.)

To get its speed, the Flat Datacenter Storage team grabbed another technology from Microsoft Research, called full bisection bandwidth networks, and specifically, each node in the cluster could transmit data at 2Gb/sec and receive data at 2Gb/sec without interruption. "That’s 20 times as much bandwidth as most computers in data centers have today," Elson explained in the blog, "and harnessing it required novel techniques".

And the Daytona car beat the Indy car this time around.

Microsoft used an unnamed remote file system that was linked to that full bisection bandwidth network to feed data to all of the nodes in the cluster to run the MinuteSort test, which is the way such sorting benchmarks were done before the MapReduce method came along. MapReduce is great for certain kinds of data-munching, like when a set of data can fit inside of a single server node. But, as Elson points out, what happens when you have two very large data sets that you want to merge and then chew on? How do you do that on MapReduce? The data has to move, and it has to move to somewhere that the systems doing the sort can get access at very high speeds.

The Microsoft Research team that developed the Fast Datacenter Storage algorithm is presenting its results at the 2012 SIGMOD/PODS Conference in Scottsdale, Arizona this week.

The research behind the new sorting method was sponsored in part by Microsoft's Bing team because it can be applied to search engine results as well as to gene sequencing and stitching together aerial photographs. The company is pretty keyed up that it can get a factor of 16 improvement in the efficiency of sorting per server using a remote (and unnamed file system) compared to Hadoop and its HDFS. ®

Choosing a cloud hosting partner with confidence

More from The Register

next story
Wanna keep your data for 1,000 YEARS? No? Hard luck, HDS wants you to anyway
Combine Blu-ray and M-DISC and you get this monster
US boffins demo 'twisted radio' mux
OAM takes wireless signals to 32 Gbps
No biggie: EMC's XtremIO firmware upgrade 'will wipe data'
But it'll have no impact and will be seamless, we're told
Microsoft's Office Delve wants work to be more like being on Facebook
Office Graph, social features for Office 365 going public
Apple flops out 2FA for iCloud in bid to stop future nude selfie leaks
Millions of 4chan users howl with laughter as Cupertino slams stable door
prev story

Whitepapers

Providing a secure and efficient Helpdesk
A single remote control platform for user support is be key to providing an efficient helpdesk. Retain full control over the way in which screen and keystroke data is transmitted.
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.
Security and trust: The backbone of doing business over the internet
Explores the current state of website security and the contributions Symantec is making to help organizations protect critical data and build trust with customers.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.