Feeds

Google open sources MapReduce compression

In the name of speed

Providing a secure and efficient Helpdesk

Google has open sourced the compression library used across its backend infrastructure, including MapReduce, its distributed number-crunching platform, and BigTable, its distributed database.

Available at Google Code under an Apache 2.0 license, the library is called Snappy, but Google says this is the same library that was previously referred to as Zippy in some public presentations. As the names imply, the library's primary aim is speed. "It does not aim for maximum compression, or compatibility with any other compression library," Google says. "Instead, it aims for very high speeds and reasonable compression."

Compared to the fastest mode of the popular zlib compression library, Google says, the C++-based Snappy is an order of magnitude faster in most cases (roughly ten times faster), but the compressed files are between 20 and 100 per cent larger. Running in 64-bit mode on a single core of a 2.26Ghz "Westmere" Intel Core i7 processor, according to the company, Snappy compresses at roughly 250MB/sec and decompresses at 500MB/sec.

Google says that the typical compression ratios are about 1.5x to 1.7x for plain text and about 2x to 4x for HTML. zlib in its fastest mode gives you 2.6x to 2.8x for plain text and 3x to 7x for HTML. " So if you want to save space, or want to compress once and decompress lots of times, use zlib (or bzip2, or…). But if you just want to cut down on your I/O, be it network or disk I/O, Snappy might be for you," says Google engineer Steinar Gunderson.

According to Gunderson, Snappy removes the "entropy reduction" step that characterizes zlib and other LZ-style compression libraries. "Most LZ-style compressors (including zlib) consist of two parts: A matching algorithm (recognizing repetitions from data earlier in the stream, as well as things like 'abcabcabcabc') and then an entropy reduction step (almost invariably Huffman or some version of arithmetic encoding)," he says. "Snappy skips the entropy reduction and instead uses a fixed, hand-tuned packing format."

This format, Gunderson says, affords "much less" CPU usage, and he says that Google has spent years fine tuning it. Virtually all of Google's online service run atop a uniform distributed infrastructure based on the proprietary Google File System (GFS), MapReduce, BigTable, and other platforms. This have been mimicked in the open source world by the Apache Hadoop project. ®

Secure remote control for conventional and virtual desktops

More from The Register

next story
Microsoft WINDOWS 10: Seven ATE Nine. Or Eight did really
Windows NEIN skipped, tech preview due out on Wednesday
Business is back, baby! Hasta la VISTA, Win 8... Oh, yeah, Windows 9
Forget touchscreen millennials, Microsoft goes for mouse crowd
Apple: SO sorry for the iOS 8.0.1 UPDATE BUNGLE HORROR
Apple kills 'upgrade'. Hey, Microsoft. You sure you want to be like these guys?
ARM gives Internet of Things a piece of its mind – the Cortex-M7
32-bit core packs some DSP for VIP IoT CPU LOL
Microsoft on the Threshold of a new name for Windows next week
Rebranded OS reportedly set to be flung open by Redmond
Lotus Notes inventor Ozzie invents app to talk to people on your phone
Imagine that. Startup floats with voice collab app for Win iPhone
'Google is NOT the gatekeeper to the web, as some claim'
Plus: 'Pretty sure iOS 8.0.2 will just turn the iPhone into a fax machine'
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
The next step in data security
With recent increased privacy concerns and computers becoming more powerful, the chance of hackers being able to crack smaller-sized RSA keys increases.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.