Feeds

The Brangelina of Big Data: Cassandra mates with Hadoop

Open source celebrity supercouple

Security for virtualized datacentres

Think of it as the Brangelina of Big Data.

DataStax, an open-source startup based in Northern California, has combined Cassandra, the distributed database developed at Facebook, with Hadoop, the epic-number-crunching platform based on Google's backend infrastructure. Known as "Brisk", this Big Data mashup was unveiled on Wednesday, with DataStax promising to open source the platform under an Apache licence within 45 days.

The idea is to offer a single platform that provides both a low-latency database for "realtime" web-scale applications and the sort of heavy data analysis you get with Hadoop. "One thing we're hearing from [enterprises] is that they need the complete Big Data picture, from realtime low-latency applications through to tools that analyze data – and the ability to use those tools to actually feed data back into applications," DataStax vice president of products Ben Werther tells The Register.

"We're giving them a platform that offers both low-latency and analytics in one."

The Hadoop project mimics GFS, Google's distributed file system, and MapReduce, the company's distributed number-crunching platform. In 2004, Google published a pair of research papers describing the two technologies, and they were soon in the hands of independent developer Doug Cutting, who used the descriptions to build a backend platform for Nutch, his open-source web crawler. He called the platform Hadoop after his son's yellow stuffed elephant, and it was open-sourced at Apache.

The platform is now used by many of the web's biggest names, from Facebook to Yahoo! to Twitter, and there's a growing market in the enterprise, with companies including Cloudera, another Silicon Valley startup, providing support and services.

Brangelina

Brisk

Hadoop is not a realtime technology. Web giants such as Facebook use in-house Hadoop clusters to crunch epic amounts of data that can later be applied to live web services. Before offloading its core search duties to Microsoft, for instance, Yahoo! used Hadoop to build its Search Webmap, a collection of all known webpages and the metadata needed to understand those pages. Once the webmap was built, it was plugged into the company's search engine, where it could be queried by net users.

At Facebook, engineers use Hadoop clusters to analyze site data behind the firewall. But the platform is also used to crunch data for use in public Facebook services such as "Insight", the site's traffic tracker, and "People you know", the tool that recommends "friends".

Facebook engineers and analysts tap the company's Hadoop clusters via a SQL-like query language known as Hive. This was built at Facebook during one of the company's now-famous hackathons, and it's designed for use by programming novices. "We realized that although Hadoop would meet our scale needs, it was very hard to use for the average engineer, much less than average analyst in the company," Facebook open-source man David Recordon once told us.

"That was the genius of Hive. We decided to do hive on top of Hadoop, so we could use Hadoop to solve our data scalability needs but at the same time provide a very easy interface."

Separate from its Hadoop work, Facebook built Cassandra, a distributed database also based on a piece of Google's backend. Google uses a proprietary distributed database known as BigTable that runs atop the Google File System (GFS) system, and it published a paper on the technology in 2006. In echo of the Hadoop project, Facebook leaned on the paper in building Cassandra.

But Cassandra isn't a pure BigTable mimic. Facebook applied BigTable's data model to the Dynamo distributed storage system developed by Amazon for its S3 storage service, part of the retailer's increasingly popular Web Services cloud. Cassandra's authors included Avinash Lakshman, who helped build Dynamo at Amazon.

Facebook used Cassandra to drive the inbox search tool on its social networking site/worldwide phenomenon. And in 2008, it open-sourced the database under an Apache license.

What DataStax has done is put Hive on top of Cassandra, marrying one Facebook open source technology to another. Brisk includes both Hadoop MapReduce and Hive, letting you run epic-number-crunching jobs across commodity-hardware clusters. But it swaps out the Hadoop HDFS file system in favor of a compatible storage layer powered by Cassandra. And at the same time, you can use Cassandra as it was intended: as a database for realtime applications.

Providing a secure and efficient Helpdesk

More from The Register

next story
Docker's app containers are coming to Windows Server, says Microsoft
MS chases app deployment speeds already enjoyed by Linux devs
IBM storage revenues sink: 'We are disappointed,' says CEO
Time to put the storage biz up for sale?
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
Symantec backs out of Backup Exec: Plans to can appliance in Jan
Will still provide support to existing customers
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.