Feeds

Google open sources MapReduce compression

In the name of speed

Business security measures using SSL

Google has open sourced the compression library used across its backend infrastructure, including MapReduce, its distributed number-crunching platform, and BigTable, its distributed database.

Available at Google Code under an Apache 2.0 license, the library is called Snappy, but Google says this is the same library that was previously referred to as Zippy in some public presentations. As the names imply, the library's primary aim is speed. "It does not aim for maximum compression, or compatibility with any other compression library," Google says. "Instead, it aims for very high speeds and reasonable compression."

Compared to the fastest mode of the popular zlib compression library, Google says, the C++-based Snappy is an order of magnitude faster in most cases (roughly ten times faster), but the compressed files are between 20 and 100 per cent larger. Running in 64-bit mode on a single core of a 2.26Ghz "Westmere" Intel Core i7 processor, according to the company, Snappy compresses at roughly 250MB/sec and decompresses at 500MB/sec.

Google says that the typical compression ratios are about 1.5x to 1.7x for plain text and about 2x to 4x for HTML. zlib in its fastest mode gives you 2.6x to 2.8x for plain text and 3x to 7x for HTML. " So if you want to save space, or want to compress once and decompress lots of times, use zlib (or bzip2, or…). But if you just want to cut down on your I/O, be it network or disk I/O, Snappy might be for you," says Google engineer Steinar Gunderson.

According to Gunderson, Snappy removes the "entropy reduction" step that characterizes zlib and other LZ-style compression libraries. "Most LZ-style compressors (including zlib) consist of two parts: A matching algorithm (recognizing repetitions from data earlier in the stream, as well as things like 'abcabcabcabc') and then an entropy reduction step (almost invariably Huffman or some version of arithmetic encoding)," he says. "Snappy skips the entropy reduction and instead uses a fixed, hand-tuned packing format."

This format, Gunderson says, affords "much less" CPU usage, and he says that Google has spent years fine tuning it. Virtually all of Google's online service run atop a uniform distributed infrastructure based on the proprietary Google File System (GFS), MapReduce, BigTable, and other platforms. This have been mimicked in the open source world by the Apache Hadoop project. ®

Choosing a cloud hosting partner with confidence

More from The Register

next story
'Windows 9' LEAK: Microsoft's playing catchup with Linux
Multiple desktops and live tiles in restored Start button star in new vids
Not appy with your Chromebook? Well now it can run Android apps
Google offers beta of tricky OS-inside-OS tech
New 'Cosmos' browser surfs the net by TXT alone
No data plan? No WiFi? No worries ... except sluggish download speed
iOS 8 release: WebGL now runs everywhere. Hurrah for 3D graphics!
HTML 5's pretty neat ... when your browser supports it
Greater dev access to iOS 8 will put us AT RISK from HACKERS
Knocking holes in Apple's walled garden could backfire, says securo-chap
NHS grows a NoSQL backbone and rips out its Oracle Spine
Open source? In the government? Ha ha! What, wait ...?
Google extends app refund window to two hours
You now have 120 minutes to finish that game instead of 15
Intel: Hey, enterprises, drop everything and DO HADOOP
Big Data analytics projected to run on more servers than any other app
SUSE Linux owner Attachmate gobbled by Micro Focus for $2.3bn
Merger will lead to mainframe and COBOL powerhouse
prev story

Whitepapers

Providing a secure and efficient Helpdesk
A single remote control platform for user support is be key to providing an efficient helpdesk. Retain full control over the way in which screen and keystroke data is transmitted.
WIN a very cool portable ZX Spectrum
Win a one-off portable Spectrum built by legendary hardware hacker Ben Heck
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Security and trust: The backbone of doing business over the internet
Explores the current state of website security and the contributions Symantec is making to help organizations protect critical data and build trust with customers.