Feeds

The Brangelina of Big Data: Cassandra mates with Hadoop

Open source celebrity supercouple

Intelligent flash storage arrays

Think of it as the Brangelina of Big Data.

DataStax, an open-source startup based in Northern California, has combined Cassandra, the distributed database developed at Facebook, with Hadoop, the epic-number-crunching platform based on Google's backend infrastructure. Known as "Brisk", this Big Data mashup was unveiled on Wednesday, with DataStax promising to open source the platform under an Apache licence within 45 days.

The idea is to offer a single platform that provides both a low-latency database for "realtime" web-scale applications and the sort of heavy data analysis you get with Hadoop. "One thing we're hearing from [enterprises] is that they need the complete Big Data picture, from realtime low-latency applications through to tools that analyze data – and the ability to use those tools to actually feed data back into applications," DataStax vice president of products Ben Werther tells The Register.

"We're giving them a platform that offers both low-latency and analytics in one."

The Hadoop project mimics GFS, Google's distributed file system, and MapReduce, the company's distributed number-crunching platform. In 2004, Google published a pair of research papers describing the two technologies, and they were soon in the hands of independent developer Doug Cutting, who used the descriptions to build a backend platform for Nutch, his open-source web crawler. He called the platform Hadoop after his son's yellow stuffed elephant, and it was open-sourced at Apache.

The platform is now used by many of the web's biggest names, from Facebook to Yahoo! to Twitter, and there's a growing market in the enterprise, with companies including Cloudera, another Silicon Valley startup, providing support and services.

Brangelina

Brisk

Hadoop is not a realtime technology. Web giants such as Facebook use in-house Hadoop clusters to crunch epic amounts of data that can later be applied to live web services. Before offloading its core search duties to Microsoft, for instance, Yahoo! used Hadoop to build its Search Webmap, a collection of all known webpages and the metadata needed to understand those pages. Once the webmap was built, it was plugged into the company's search engine, where it could be queried by net users.

At Facebook, engineers use Hadoop clusters to analyze site data behind the firewall. But the platform is also used to crunch data for use in public Facebook services such as "Insight", the site's traffic tracker, and "People you know", the tool that recommends "friends".

Facebook engineers and analysts tap the company's Hadoop clusters via a SQL-like query language known as Hive. This was built at Facebook during one of the company's now-famous hackathons, and it's designed for use by programming novices. "We realized that although Hadoop would meet our scale needs, it was very hard to use for the average engineer, much less than average analyst in the company," Facebook open-source man David Recordon once told us.

"That was the genius of Hive. We decided to do hive on top of Hadoop, so we could use Hadoop to solve our data scalability needs but at the same time provide a very easy interface."

Separate from its Hadoop work, Facebook built Cassandra, a distributed database also based on a piece of Google's backend. Google uses a proprietary distributed database known as BigTable that runs atop the Google File System (GFS) system, and it published a paper on the technology in 2006. In echo of the Hadoop project, Facebook leaned on the paper in building Cassandra.

But Cassandra isn't a pure BigTable mimic. Facebook applied BigTable's data model to the Dynamo distributed storage system developed by Amazon for its S3 storage service, part of the retailer's increasingly popular Web Services cloud. Cassandra's authors included Avinash Lakshman, who helped build Dynamo at Amazon.

Facebook used Cassandra to drive the inbox search tool on its social networking site/worldwide phenomenon. And in 2008, it open-sourced the database under an Apache license.

What DataStax has done is put Hive on top of Cassandra, marrying one Facebook open source technology to another. Brisk includes both Hadoop MapReduce and Hive, letting you run epic-number-crunching jobs across commodity-hardware clusters. But it swaps out the Hadoop HDFS file system in favor of a compatible storage layer powered by Cassandra. And at the same time, you can use Cassandra as it was intended: as a database for realtime applications.

Beginner's guide to SSL certificates

More from The Register

next story
NSA SOURCE CODE LEAK: Information slurp tools to appear online
Now you can run your own intelligence agency
Azure TITSUP caused by INFINITE LOOP
Fat fingered geo-block kept Aussies in the dark
Yahoo! blames! MONSTER! email! OUTAGE! on! CUT! CABLE! bungle!
Weekend woe for BT as telco struggles to restore service
Cloud unicorns are extinct so DiData cloud mess was YOUR fault
Applications need to be built to handle TITSUP incidents
BOFH: WHERE did this 'fax-enabled' printer UPGRADE come from?
Don't worry about that cable, it's part of the config
Stop the IoT revolution! We need to figure out packet sizes first
Researchers test 802.15.4 and find we know nuh-think! about large scale sensor network ops
SanDisk vows: We'll have a 16TB SSD WHOPPER by 2016
Flash WORM has a serious use for archived photos and videos
Astro-boffins start opening universe simulation data
Got a supercomputer? Want to simulate a universe? Here you go
Microsoft adds video offering to Office 365. Oh NOES, you'll need Adobe Flash
Lovely presentations... but not on your Flash-hating mobe
prev story

Whitepapers

Go beyond APM with real-time IT operations analytics
How IT operations teams can harness the wealth of wire data already flowing through their environment for real-time operational intelligence.
5 critical considerations for enterprise cloud backup
Key considerations when evaluating cloud backup solutions to ensure adequate protection security and availability of enterprise data.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
How to simplify SSL certificate management
Simple steps to take control of SSL certificates across the enterprise, and recommendations centralizing certificate management throughout their lifecycle.