Feeds

Microsoft's FDS data-sorter crushes Hadoop

Getting Bing to sit up and bark

3 Big data security analytics techniques

Techies at Microsoft Research, the big brain arm of the software goliath, have taken the crown in the sorting benchmark world. The researchers are thinking about how to implement new sorting algorithms in the Bing search engine to give Microsoft a leg up on the MapReduce algorithms that underpin Google's search engine and other big data-munching applications.

In the data world, the benchmarks against which you measure your performance are collectively known as the Sort Benchmarks. The two original benchmarks proposed by techies in 1994 at Digital Equipment – led by Jim Gray, who administered the Sort Benchmarks until he was lost at sea in 2007 – were collectively known as MinuteSort and PennySort, and they are still used today. (Gray originally worked on the VAX and AlphaServer lines at DEC, but eventually moved to Microsoft.)

The MinuteSort test counts how many bytes of data you can sort using the benchmark code in 60 seconds and the PennySort test measures the amount of data you can sort for a penny's worth of system time on a machine or cluster set up to run the test. There used to be other tests, such as the TeraByte Sort, which was the amount of time it took to sort through 1TB of data, but as servers, storage, and networks have progressed, the MinuteSort has become the new TeraByte Sort. Or at least now it has, with the Microsoft team breaking through the terabyte barrier with their Flat Datacenter Storage scheme.

Microsoft has just stomped the living daylights out of a Hadoop cluster that was the previous record-holder on the MinuteSort test, and did so by substantially beefing up the network capacity between server nodes and storage and essentially chucking the whole MapReduce approach to data munching out the window.

With Hadoop and its MapReduce approach, you have data scattered and replicated around the cluster for both availability and algorithmic reasons, and you dispatch the computing to the server nodes where you need to process data – instead of trying to move data from a particular piece of storage to a server. This approach is what allows search engines like those developed by Google, Yahoo!, and Microsoft (which used to be distinct) to mine massive amounts of clickstream data to serve you the most appropriate web pages and advertisements. But Hadoop's scalability is only about 4,000 nodes and it is a batch-oriented program, not something that looks and feels like real time.

Microsoft is not releasing the details of its Flat Datacenter Storage approach yet; it may never do so because it gives the company a competitive advantage. But the company did provide some clues to how it was able to beat a Hadoop cluster configured by Yahoo!, which cloned Google's 2004 MapReduce methodology and file system to create Hadoop and the Hadoop Distributed File System, by nearly a factor of three in terms of performance – and, according to a blog post, using one-sixth the number of servers.

The Flat Datacenter Storage effort is headed up by Jeremy Elson, who works in the Distributed Systems and Networking Group at Microsoft Research. The MinuteSort run that Elson's team ran on 250 machines configured with 1,033 disk drives was able to rip through and sort 1,401 gigabytes of data in 60 seconds, handily beating a Yahoo! Hadoop configuration from 2009 that had 1,406 nodes and 5,624 disks that could process 500GB in a minute.

It is not clear that Yahoo! could have fielded a better result using a more modern version of the Apache Hadoop software and shiny new x86 iron. (By the way, this sort was on the "Daytona" version of the benchmark, which is based on using stock code, not the "Indy" version of the test, which allows for more exotic algorithms. A team at the University of California San Diego won the Indy MinuteSort race last year with a 52 node cluster of HP ProLiant DL360 G6 servers and a Cisco Systems Nexus 5096 switch. This TritonSort machine at UCSD (PDF) was able to sort 1,353GB of data in 60 seconds.)

To get its speed, the Flat Datacenter Storage team grabbed another technology from Microsoft Research, called full bisection bandwidth networks, and specifically, each node in the cluster could transmit data at 2Gb/sec and receive data at 2Gb/sec without interruption. "That’s 20 times as much bandwidth as most computers in data centers have today," Elson explained in the blog, "and harnessing it required novel techniques".

And the Daytona car beat the Indy car this time around.

Microsoft used an unnamed remote file system that was linked to that full bisection bandwidth network to feed data to all of the nodes in the cluster to run the MinuteSort test, which is the way such sorting benchmarks were done before the MapReduce method came along. MapReduce is great for certain kinds of data-munching, like when a set of data can fit inside of a single server node. But, as Elson points out, what happens when you have two very large data sets that you want to merge and then chew on? How do you do that on MapReduce? The data has to move, and it has to move to somewhere that the systems doing the sort can get access at very high speeds.

The Microsoft Research team that developed the Fast Datacenter Storage algorithm is presenting its results at the 2012 SIGMOD/PODS Conference in Scottsdale, Arizona this week.

The research behind the new sorting method was sponsored in part by Microsoft's Bing team because it can be applied to search engine results as well as to gene sequencing and stitching together aerial photographs. The company is pretty keyed up that it can get a factor of 16 improvement in the efficiency of sorting per server using a remote (and unnamed file system) compared to Hadoop and its HDFS. ®

SANS - Survey on application security programs

More from The Register

next story
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Kingston DataTraveler MicroDuo: Turn your phone into a 72GB beast
USB-usiness in the front, micro-USB party in the back
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
BOFH: Oh DO tell us what you think. *CLICK*
$%%&amp Oh dear, we've been cut *CLICK* Well hello *CLICK* You're breaking up...
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
IT bods: How long does it take YOU to train up on new tech?
I'll leave my arrays to do the hard work, if you don't mind
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
Cisco reps flog Whiptail's Invicta arrays against EMC and Pure
Storage reseller report reveals who's selling what
prev story

Whitepapers

Designing a defence for mobile apps
In this whitepaper learn the various considerations for defending mobile applications; from the mobile application architecture itself to the myriad testing technologies needed to properly assess mobile applications risk.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.