The Brangelina of Big Data: Cassandra mates with Hadoop
Open source celebrity supercouple
Think of it as the Brangelina of Big Data.
DataStax, an open-source startup based in Northern California, has combined Cassandra, the distributed database developed at Facebook, with Hadoop, the epic-number-crunching platform based on Google's backend infrastructure. Known as "Brisk", this Big Data mashup was unveiled on Wednesday, with DataStax promising to open source the platform under an Apache licence within 45 days.
The idea is to offer a single platform that provides both a low-latency database for "realtime" web-scale applications and the sort of heavy data analysis you get with Hadoop. "One thing we're hearing from [enterprises] is that they need the complete Big Data picture, from realtime low-latency applications through to tools that analyze data – and the ability to use those tools to actually feed data back into applications," DataStax vice president of products Ben Werther tells The Register.
"We're giving them a platform that offers both low-latency and analytics in one."
The Hadoop project mimics GFS, Google's distributed file system, and MapReduce, the company's distributed number-crunching platform. In 2004, Google published a pair of research papers describing the two technologies, and they were soon in the hands of independent developer Doug Cutting, who used the descriptions to build a backend platform for Nutch, his open-source web crawler. He called the platform Hadoop after his son's yellow stuffed elephant, and it was open-sourced at Apache.
The platform is now used by many of the web's biggest names, from Facebook to Yahoo! to Twitter, and there's a growing market in the enterprise, with companies including Cloudera, another Silicon Valley startup, providing support and services.
Hadoop is not a realtime technology. Web giants such as Facebook use in-house Hadoop clusters to crunch epic amounts of data that can later be applied to live web services. Before offloading its core search duties to Microsoft, for instance, Yahoo! used Hadoop to build its Search Webmap, a collection of all known webpages and the metadata needed to understand those pages. Once the webmap was built, it was plugged into the company's search engine, where it could be queried by net users.
At Facebook, engineers use Hadoop clusters to analyze site data behind the firewall. But the platform is also used to crunch data for use in public Facebook services such as "Insight", the site's traffic tracker, and "People you know", the tool that recommends "friends".
Facebook engineers and analysts tap the company's Hadoop clusters via a SQL-like query language known as Hive. This was built at Facebook during one of the company's now-famous hackathons, and it's designed for use by programming novices. "We realized that although Hadoop would meet our scale needs, it was very hard to use for the average engineer, much less than average analyst in the company," Facebook open-source man David Recordon once told us.
"That was the genius of Hive. We decided to do hive on top of Hadoop, so we could use Hadoop to solve our data scalability needs but at the same time provide a very easy interface."
Separate from its Hadoop work, Facebook built Cassandra, a distributed database also based on a piece of Google's backend. Google uses a proprietary distributed database known as BigTable that runs atop the Google File System (GFS) system, and it published a paper on the technology in 2006. In echo of the Hadoop project, Facebook leaned on the paper in building Cassandra.
But Cassandra isn't a pure BigTable mimic. Facebook applied BigTable's data model to the Dynamo distributed storage system developed by Amazon for its S3 storage service, part of the retailer's increasingly popular Web Services cloud. Cassandra's authors included Avinash Lakshman, who helped build Dynamo at Amazon.
Facebook used Cassandra to drive the inbox search tool on its social networking site/worldwide phenomenon. And in 2008, it open-sourced the database under an Apache license.
What DataStax has done is put Hive on top of Cassandra, marrying one Facebook open source technology to another. Brisk includes both Hadoop MapReduce and Hive, letting you run epic-number-crunching jobs across commodity-hardware clusters. But it swaps out the Hadoop HDFS file system in favor of a compatible storage layer powered by Cassandra. And at the same time, you can use Cassandra as it was intended: as a database for realtime applications.
Next page: Horses for Big Data courses