Feeds

The Brangelina of Big Data: Cassandra mates with Hadoop

Open source celebrity supercouple

Securing Web Applications Made Simple and Scalable

Think of it as the Brangelina of Big Data.

DataStax, an open-source startup based in Northern California, has combined Cassandra, the distributed database developed at Facebook, with Hadoop, the epic-number-crunching platform based on Google's backend infrastructure. Known as "Brisk", this Big Data mashup was unveiled on Wednesday, with DataStax promising to open source the platform under an Apache licence within 45 days.

The idea is to offer a single platform that provides both a low-latency database for "realtime" web-scale applications and the sort of heavy data analysis you get with Hadoop. "One thing we're hearing from [enterprises] is that they need the complete Big Data picture, from realtime low-latency applications through to tools that analyze data – and the ability to use those tools to actually feed data back into applications," DataStax vice president of products Ben Werther tells The Register.

"We're giving them a platform that offers both low-latency and analytics in one."

The Hadoop project mimics GFS, Google's distributed file system, and MapReduce, the company's distributed number-crunching platform. In 2004, Google published a pair of research papers describing the two technologies, and they were soon in the hands of independent developer Doug Cutting, who used the descriptions to build a backend platform for Nutch, his open-source web crawler. He called the platform Hadoop after his son's yellow stuffed elephant, and it was open-sourced at Apache.

The platform is now used by many of the web's biggest names, from Facebook to Yahoo! to Twitter, and there's a growing market in the enterprise, with companies including Cloudera, another Silicon Valley startup, providing support and services.

Brangelina

Brisk

Hadoop is not a realtime technology. Web giants such as Facebook use in-house Hadoop clusters to crunch epic amounts of data that can later be applied to live web services. Before offloading its core search duties to Microsoft, for instance, Yahoo! used Hadoop to build its Search Webmap, a collection of all known webpages and the metadata needed to understand those pages. Once the webmap was built, it was plugged into the company's search engine, where it could be queried by net users.

At Facebook, engineers use Hadoop clusters to analyze site data behind the firewall. But the platform is also used to crunch data for use in public Facebook services such as "Insight", the site's traffic tracker, and "People you know", the tool that recommends "friends".

Facebook engineers and analysts tap the company's Hadoop clusters via a SQL-like query language known as Hive. This was built at Facebook during one of the company's now-famous hackathons, and it's designed for use by programming novices. "We realized that although Hadoop would meet our scale needs, it was very hard to use for the average engineer, much less than average analyst in the company," Facebook open-source man David Recordon once told us.

"That was the genius of Hive. We decided to do hive on top of Hadoop, so we could use Hadoop to solve our data scalability needs but at the same time provide a very easy interface."

Separate from its Hadoop work, Facebook built Cassandra, a distributed database also based on a piece of Google's backend. Google uses a proprietary distributed database known as BigTable that runs atop the Google File System (GFS) system, and it published a paper on the technology in 2006. In echo of the Hadoop project, Facebook leaned on the paper in building Cassandra.

But Cassandra isn't a pure BigTable mimic. Facebook applied BigTable's data model to the Dynamo distributed storage system developed by Amazon for its S3 storage service, part of the retailer's increasingly popular Web Services cloud. Cassandra's authors included Avinash Lakshman, who helped build Dynamo at Amazon.

Facebook used Cassandra to drive the inbox search tool on its social networking site/worldwide phenomenon. And in 2008, it open-sourced the database under an Apache license.

What DataStax has done is put Hive on top of Cassandra, marrying one Facebook open source technology to another. Brisk includes both Hadoop MapReduce and Hive, letting you run epic-number-crunching jobs across commodity-hardware clusters. But it swaps out the Hadoop HDFS file system in favor of a compatible storage layer powered by Cassandra. And at the same time, you can use Cassandra as it was intended: as a database for realtime applications.

The Essential Guide to IT Transformation

More from The Register

next story
Manic malware Mayhem spreads through Linux, FreeBSD web servers
And how Google could cripple infection rate in a second
EU's top data cops to meet Google, Microsoft et al over 'right to be forgotten'
Plan to hammer out 'coherent' guidelines. Good luck chaps!
US judge: YES, cops or feds so can slurp an ENTIRE Gmail account
Crooks don't have folders labelled 'drug records', opines NY beak
FLAPE – the next BIG THING in storage
Find cold data with flash, transmit it from tape
Seagate chances ARM with NAS boxes for the SOHO crowd
There's an Atom-powered offering, too
Gartner: To the right, to the right – biz sync firms who've won in a box to the right...
Magic quadrant: Top marks for, er, completeness of vision, EMC
prev story

Whitepapers

Top three mobile application threats
Prevent sensitive data leakage over insecure channels or stolen mobile devices.
The Essential Guide to IT Transformation
ServiceNow discusses three IT transformations that can help CIO's automate IT services to transform IT and the enterprise.
Mobile application security vulnerability report
The alarming realities regarding the sheer number of applications vulnerable to attack, and the most common and easily addressable vulnerability errors.
How modern custom applications can spur business growth
Learn how to create, deploy and manage custom applications without consuming or expanding the need for scarce, expensive IT resources.
Consolidation: the foundation for IT and business transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.