Feeds

HBase: Shops swap MySQL for open source Google mimic

Microsoft doesn't want it. But everyone else does

Boost IT visibility and business value

Facebook isn't the only one swapping MySQL for HBase, the open source distributed database platform based on Google's BigTable. The Hadoopian HBase is now in play at several of the web's most recognizable names – including Adobe, Yahoo!, Mozilla, and StumbleUpon – as well as smaller operations looking to climb their way to such online prominence.

HBase can't replace all MySQL installations, says HBase project chair Michael Stack, but unlike MySQL, it's specifically designed to operate at ginormous scale.

HBase is part of the Apache Hadoop project, a sweeping effort to mimic Google's proprietary infrastructure with open source code. It dovetails with HDFS, the Hadoop distributed file system, and Hadoop MapReduce, the distributed number-crunching platform. HBase is essentially a low-latency layer that sits atop HDFS, letting you rapidly store and retrieve data. It's fashioned after Google's BigTable platform, which Mountain View publicly described in a 2006 research paper.

HBase project chair Michael Stack is on staff at StumbleUpon, which has long used HBase for the real-time public counters that track users and pageviews across its service. StumbleUpon still employs MySQL in many areas and will continue to do so. But the idea is to swap in HBase wherever scale is an issue.

"I don't foresee StumbleUpon ever giving up on all of its MySQL instances. RDBMSs are just too useful," Stack tells The Reg. "The plan, though, is to shrink what MySQL does over time, let MySQL do what its good at and have HBase take over where MySQL is running up against limits handling ever-growing write rates, table sizes, etc."

In similar fashion, Canadian startup Tynt is moving from MySQL to HBase and Hadoop so it can readily scale its service, which lets websites distribute URLs whenever netizens cut-and-paste content. The service is meant to generate extra traffic for sites, but it also provides sites with data describing all the traffic – and cutting-and-pasting – it sees. Tynt is now used by over 600,000 online publishers, with the company logging over 20,000 events per second, and according to company CTO Cameron Befus, Tynt's MySQL infrastructure couldn't keep up with the service's growth.

The company is now using HDFS and MapReduce to store and analyze all that data, and this month it will begin to use HBase to serve up the data in real time. "We were growing at an exponential rate. The volume of data we were called on to produce was more than doubling every month," Befus says. "We knew that MySQL couldn't really handle effectively what we had, let alone what we expected. ... We're exceeding 20,000 events per second, and you've got to spread that across a large number of MySQL servers, and as you do that, it becomes very inefficient."

What's more, says Amr Awadallah, vice president of engineering and CTO at Cloudera, the commercial Hadoop outfit that helped erect the company's Hadoop platform, simply adding MySQL servers is more difficult. "The headache is that every time you want to add a new MySQL server, it doesn't just assimilate into the collective easily," Awadallah explains.

"You have to repartition your data and rebalance your hashing technique across the new server and [specify] which range of keys now fall on that server and so on. With HBase, this happens transparently. You add nodes and you tell HBase you've added nodes and you join the collective."

Cloudera is what you might call a Red Hat for Hadoop. It offers its support and services for its own Hadoop distros. Tynt received consulting help from Cloudera when setting up a back-end platform based on the completely open source Cloudera Distribution of Hadoop, and it now pays Cloudera for support and updates.

At Tynt, HBase will initially be used to provide realtime API access to the service's analytics data, and it will eventually be used for other real-time tools as well. "HBase will also provide analytics, but much faster [than just MapReduce]." Befus says. MapReduce does batch processing; it doesn't provide real-time access to data.

Meanwhile, Yahoo! – which bootstrapped the Hadoop's core HDFS and MapReduce projects – is using HBase as part of its COKE system (Content Optimization Knowledge Engine), a means of automatically selecting news stories for its front page. Mozilla has moved its Socorro crash-reporting system HBase. And Adobe is using the platform to drive services across the company.

Like these outfits, Facebook is a longtime MySQL house. But its new messaging system – unveiled this past fall – uses HBase to juggle email, chat, and SMS as well as traditional on-site Facebook messages. HBase stores the text and metadata for messages as well as the indices needed to search them. The previous system needed about 75TB to store a month's worth of messages, and that figure will only grow with the new setup.

"The email workload is a write-dominated workload. We need to make a lot of writes very quickly," Facebook infrastructure guru Karthik Ranganathan said in a recent Facebook webcast. "We used HBase for the data that grows very fast, which is essentially the metadata."

But for all its success, HBase has lost one big-name user.

HBase was founded by Powerset, a San Francisco-based semantic search startup. Michael Stack was among the Powerset developers who helped get the project off the ground. In the summer of 2008, Microsoft acquired Powerset, and it eventually gave Stack and fellow committer Jim Kellerman the go-ahead to continue their contributions to the project.

"This is the first time we have acquired a company with committers to a key open source project who have been able to continue to commit to that project in their old capacity as part of their new role," Sam Ramji, Microsoft's then senior director of platform strategy told us at the time.

The HBase-based Powerset was folded into Bing, making the search engine one of the first "shipping" Microsoft product to actually include open source code. But a year an a half on, Powerset is no longer running on Hadoop. "As far as I know, there is no Hadoop or HBase in operation at Powerset these days," Stack says. And Microsoft has confirmed this with The Reg.

Hadoop, you see, doesn't really run on Windows. As much as things change at Microsoft, they stay the same. It was 13 years ago that Redmond purchased Hotmail, ripped out its FreeBSD servers, and replaced them with Windows 2000. ®

Build a business case: developing custom apps

More from The Register

next story
KDE releases ice-cream coloured Plasma 5 just in time for summer
Melty but refreshing - popular rival to Mint's Cinnamon's still a work in progress
Leaked Windows Phone 8.1 Update specs tease details of Nokia's next mobes
New screen sizes, dual SIMs, voice over LTE, and more
Mozilla keeps its Beard, hopes anti-gay marriage troubles are now over
Plenty on new CEO's todo list – starting with Firefox's slipping grasp
Apple: We'll unleash OS X Yosemite beta on the MASSES on 24 July
Starting today, regular fanbois will be guinea pigs, it tells Reg
Another day, another Firefox: Version 31 is upon us ALREADY
Web devs, Mozilla really wants you to like this one
Secure microkernel that uses maths to be 'bug free' goes open source
Hacker-repelling, drone-protecting code will soon be yours to tweak as you see fit
Cloudy CoreOS Linux distro declares itself production-ready
Lightweight, container-happy Linux gets first Stable release
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
The Essential Guide to IT Transformation
ServiceNow discusses three IT transformations that can help CIO's automate IT services to transform IT and the enterprise.
Maximize storage efficiency across the enterprise
The HP StoreOnce backup solution offers highly flexible, centrally managed, and highly efficient data protection for any enterprise.