Feeds

The Brangelina of Big Data: Cassandra mates with Hadoop

Open source celebrity supercouple

Beginner's guide to SSL certificates

Think of it as the Brangelina of Big Data.

DataStax, an open-source startup based in Northern California, has combined Cassandra, the distributed database developed at Facebook, with Hadoop, the epic-number-crunching platform based on Google's backend infrastructure. Known as "Brisk", this Big Data mashup was unveiled on Wednesday, with DataStax promising to open source the platform under an Apache licence within 45 days.

The idea is to offer a single platform that provides both a low-latency database for "realtime" web-scale applications and the sort of heavy data analysis you get with Hadoop. "One thing we're hearing from [enterprises] is that they need the complete Big Data picture, from realtime low-latency applications through to tools that analyze data – and the ability to use those tools to actually feed data back into applications," DataStax vice president of products Ben Werther tells The Register.

"We're giving them a platform that offers both low-latency and analytics in one."

The Hadoop project mimics GFS, Google's distributed file system, and MapReduce, the company's distributed number-crunching platform. In 2004, Google published a pair of research papers describing the two technologies, and they were soon in the hands of independent developer Doug Cutting, who used the descriptions to build a backend platform for Nutch, his open-source web crawler. He called the platform Hadoop after his son's yellow stuffed elephant, and it was open-sourced at Apache.

The platform is now used by many of the web's biggest names, from Facebook to Yahoo! to Twitter, and there's a growing market in the enterprise, with companies including Cloudera, another Silicon Valley startup, providing support and services.

Brangelina

Brisk

Hadoop is not a realtime technology. Web giants such as Facebook use in-house Hadoop clusters to crunch epic amounts of data that can later be applied to live web services. Before offloading its core search duties to Microsoft, for instance, Yahoo! used Hadoop to build its Search Webmap, a collection of all known webpages and the metadata needed to understand those pages. Once the webmap was built, it was plugged into the company's search engine, where it could be queried by net users.

At Facebook, engineers use Hadoop clusters to analyze site data behind the firewall. But the platform is also used to crunch data for use in public Facebook services such as "Insight", the site's traffic tracker, and "People you know", the tool that recommends "friends".

Facebook engineers and analysts tap the company's Hadoop clusters via a SQL-like query language known as Hive. This was built at Facebook during one of the company's now-famous hackathons, and it's designed for use by programming novices. "We realized that although Hadoop would meet our scale needs, it was very hard to use for the average engineer, much less than average analyst in the company," Facebook open-source man David Recordon once told us.

"That was the genius of Hive. We decided to do hive on top of Hadoop, so we could use Hadoop to solve our data scalability needs but at the same time provide a very easy interface."

Separate from its Hadoop work, Facebook built Cassandra, a distributed database also based on a piece of Google's backend. Google uses a proprietary distributed database known as BigTable that runs atop the Google File System (GFS) system, and it published a paper on the technology in 2006. In echo of the Hadoop project, Facebook leaned on the paper in building Cassandra.

But Cassandra isn't a pure BigTable mimic. Facebook applied BigTable's data model to the Dynamo distributed storage system developed by Amazon for its S3 storage service, part of the retailer's increasingly popular Web Services cloud. Cassandra's authors included Avinash Lakshman, who helped build Dynamo at Amazon.

Facebook used Cassandra to drive the inbox search tool on its social networking site/worldwide phenomenon. And in 2008, it open-sourced the database under an Apache license.

What DataStax has done is put Hive on top of Cassandra, marrying one Facebook open source technology to another. Brisk includes both Hadoop MapReduce and Hive, letting you run epic-number-crunching jobs across commodity-hardware clusters. But it swaps out the Hadoop HDFS file system in favor of a compatible storage layer powered by Cassandra. And at the same time, you can use Cassandra as it was intended: as a database for realtime applications.

Security for virtualized datacentres

More from The Register

next story
It's Big, it's Blue... it's simply FABLESS! IBM's chip-free future
Or why the reversal of globalisation ain't gonna 'appen
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
CAGE MATCH: Microsoft, Dell open co-located bit barns in Oz
Whole new species of XaaS spawning in the antipodes
Microsoft and Dell’s cloud in a box: Instant Azure for the data centre
A less painful way to run Microsoft’s private cloud
AWS pulls desktop-as-a-service from the PC
Support for PCoIP protocol means zero clients can run cloudy desktops
prev story

Whitepapers

Choosing cloud Backup services
Demystify how you can address your data protection needs in your small- to medium-sized business and select the best online backup service to meet your needs.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.