Feeds

Microsoft's FDS data-sorter crushes Hadoop

Getting Bing to sit up and bark

Internet Security Threat Report 2014

Techies at Microsoft Research, the big brain arm of the software goliath, have taken the crown in the sorting benchmark world. The researchers are thinking about how to implement new sorting algorithms in the Bing search engine to give Microsoft a leg up on the MapReduce algorithms that underpin Google's search engine and other big data-munching applications.

In the data world, the benchmarks against which you measure your performance are collectively known as the Sort Benchmarks. The two original benchmarks proposed by techies in 1994 at Digital Equipment – led by Jim Gray, who administered the Sort Benchmarks until he was lost at sea in 2007 – were collectively known as MinuteSort and PennySort, and they are still used today. (Gray originally worked on the VAX and AlphaServer lines at DEC, but eventually moved to Microsoft.)

The MinuteSort test counts how many bytes of data you can sort using the benchmark code in 60 seconds and the PennySort test measures the amount of data you can sort for a penny's worth of system time on a machine or cluster set up to run the test. There used to be other tests, such as the TeraByte Sort, which was the amount of time it took to sort through 1TB of data, but as servers, storage, and networks have progressed, the MinuteSort has become the new TeraByte Sort. Or at least now it has, with the Microsoft team breaking through the terabyte barrier with their Flat Datacenter Storage scheme.

Microsoft has just stomped the living daylights out of a Hadoop cluster that was the previous record-holder on the MinuteSort test, and did so by substantially beefing up the network capacity between server nodes and storage and essentially chucking the whole MapReduce approach to data munching out the window.

With Hadoop and its MapReduce approach, you have data scattered and replicated around the cluster for both availability and algorithmic reasons, and you dispatch the computing to the server nodes where you need to process data – instead of trying to move data from a particular piece of storage to a server. This approach is what allows search engines like those developed by Google, Yahoo!, and Microsoft (which used to be distinct) to mine massive amounts of clickstream data to serve you the most appropriate web pages and advertisements. But Hadoop's scalability is only about 4,000 nodes and it is a batch-oriented program, not something that looks and feels like real time.

Microsoft is not releasing the details of its Flat Datacenter Storage approach yet; it may never do so because it gives the company a competitive advantage. But the company did provide some clues to how it was able to beat a Hadoop cluster configured by Yahoo!, which cloned Google's 2004 MapReduce methodology and file system to create Hadoop and the Hadoop Distributed File System, by nearly a factor of three in terms of performance – and, according to a blog post, using one-sixth the number of servers.

The Flat Datacenter Storage effort is headed up by Jeremy Elson, who works in the Distributed Systems and Networking Group at Microsoft Research. The MinuteSort run that Elson's team ran on 250 machines configured with 1,033 disk drives was able to rip through and sort 1,401 gigabytes of data in 60 seconds, handily beating a Yahoo! Hadoop configuration from 2009 that had 1,406 nodes and 5,624 disks that could process 500GB in a minute.

It is not clear that Yahoo! could have fielded a better result using a more modern version of the Apache Hadoop software and shiny new x86 iron. (By the way, this sort was on the "Daytona" version of the benchmark, which is based on using stock code, not the "Indy" version of the test, which allows for more exotic algorithms. A team at the University of California San Diego won the Indy MinuteSort race last year with a 52 node cluster of HP ProLiant DL360 G6 servers and a Cisco Systems Nexus 5096 switch. This TritonSort machine at UCSD (PDF) was able to sort 1,353GB of data in 60 seconds.)

To get its speed, the Flat Datacenter Storage team grabbed another technology from Microsoft Research, called full bisection bandwidth networks, and specifically, each node in the cluster could transmit data at 2Gb/sec and receive data at 2Gb/sec without interruption. "That’s 20 times as much bandwidth as most computers in data centers have today," Elson explained in the blog, "and harnessing it required novel techniques".

And the Daytona car beat the Indy car this time around.

Microsoft used an unnamed remote file system that was linked to that full bisection bandwidth network to feed data to all of the nodes in the cluster to run the MinuteSort test, which is the way such sorting benchmarks were done before the MapReduce method came along. MapReduce is great for certain kinds of data-munching, like when a set of data can fit inside of a single server node. But, as Elson points out, what happens when you have two very large data sets that you want to merge and then chew on? How do you do that on MapReduce? The data has to move, and it has to move to somewhere that the systems doing the sort can get access at very high speeds.

The Microsoft Research team that developed the Fast Datacenter Storage algorithm is presenting its results at the 2012 SIGMOD/PODS Conference in Scottsdale, Arizona this week.

The research behind the new sorting method was sponsored in part by Microsoft's Bing team because it can be applied to search engine results as well as to gene sequencing and stitching together aerial photographs. The company is pretty keyed up that it can get a factor of 16 improvement in the efficiency of sorting per server using a remote (and unnamed file system) compared to Hadoop and its HDFS. ®

Beginner's guide to SSL certificates

More from The Register

next story
NSA SOURCE CODE LEAK: Information slurp tools to appear online
Now you can run your own intelligence agency
Azure TITSUP caused by INFINITE LOOP
Fat fingered geo-block kept Aussies in the dark
Yahoo! blames! MONSTER! email! OUTAGE! on! CUT! CABLE! bungle!
Weekend woe for BT as telco struggles to restore service
Cloud unicorns are extinct so DiData cloud mess was YOUR fault
Applications need to be built to handle TITSUP incidents
Stop the IoT revolution! We need to figure out packet sizes first
Researchers test 802.15.4 and find we know nuh-think! about large scale sensor network ops
Turnbull should spare us all airline-magazine-grade cloud hype
Box-hugger is not a dirty word, Minister. Box-huggers make the cloud WORK
SanDisk vows: We'll have a 16TB SSD WHOPPER by 2016
Flash WORM has a serious use for archived photos and videos
Astro-boffins start opening universe simulation data
Got a supercomputer? Want to simulate a universe? Here you go
Do you spend ages wasting time because of a bulging rack?
No more cloud-latency tea breaks for you, users! Get a load of THIS
prev story

Whitepapers

Driving business with continuous operational intelligence
Introducing an innovative approach offered by ExtraHop for producing continuous operational intelligence.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
How to determine if cloud backup is right for your servers
Two key factors, technical feasibility and TCO economics, that backup and IT operations managers should consider when assessing cloud backup.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Internet Security Threat Report 2014
An overview and analysis of the year in global threat activity: identify, analyze, and provide commentary on emerging trends in the dynamic threat landscape.