Feeds

HBase: Shops swap MySQL for open source Google mimic

Microsoft doesn't want it. But everyone else does

Secure remote control for conventional and virtual desktops

Facebook isn't the only one swapping MySQL for HBase, the open source distributed database platform based on Google's BigTable. The Hadoopian HBase is now in play at several of the web's most recognizable names – including Adobe, Yahoo!, Mozilla, and StumbleUpon – as well as smaller operations looking to climb their way to such online prominence.

HBase can't replace all MySQL installations, says HBase project chair Michael Stack, but unlike MySQL, it's specifically designed to operate at ginormous scale.

HBase is part of the Apache Hadoop project, a sweeping effort to mimic Google's proprietary infrastructure with open source code. It dovetails with HDFS, the Hadoop distributed file system, and Hadoop MapReduce, the distributed number-crunching platform. HBase is essentially a low-latency layer that sits atop HDFS, letting you rapidly store and retrieve data. It's fashioned after Google's BigTable platform, which Mountain View publicly described in a 2006 research paper.

HBase project chair Michael Stack is on staff at StumbleUpon, which has long used HBase for the real-time public counters that track users and pageviews across its service. StumbleUpon still employs MySQL in many areas and will continue to do so. But the idea is to swap in HBase wherever scale is an issue.

"I don't foresee StumbleUpon ever giving up on all of its MySQL instances. RDBMSs are just too useful," Stack tells The Reg. "The plan, though, is to shrink what MySQL does over time, let MySQL do what its good at and have HBase take over where MySQL is running up against limits handling ever-growing write rates, table sizes, etc."

In similar fashion, Canadian startup Tynt is moving from MySQL to HBase and Hadoop so it can readily scale its service, which lets websites distribute URLs whenever netizens cut-and-paste content. The service is meant to generate extra traffic for sites, but it also provides sites with data describing all the traffic – and cutting-and-pasting – it sees. Tynt is now used by over 600,000 online publishers, with the company logging over 20,000 events per second, and according to company CTO Cameron Befus, Tynt's MySQL infrastructure couldn't keep up with the service's growth.

The company is now using HDFS and MapReduce to store and analyze all that data, and this month it will begin to use HBase to serve up the data in real time. "We were growing at an exponential rate. The volume of data we were called on to produce was more than doubling every month," Befus says. "We knew that MySQL couldn't really handle effectively what we had, let alone what we expected. ... We're exceeding 20,000 events per second, and you've got to spread that across a large number of MySQL servers, and as you do that, it becomes very inefficient."

What's more, says Amr Awadallah, vice president of engineering and CTO at Cloudera, the commercial Hadoop outfit that helped erect the company's Hadoop platform, simply adding MySQL servers is more difficult. "The headache is that every time you want to add a new MySQL server, it doesn't just assimilate into the collective easily," Awadallah explains.

"You have to repartition your data and rebalance your hashing technique across the new server and [specify] which range of keys now fall on that server and so on. With HBase, this happens transparently. You add nodes and you tell HBase you've added nodes and you join the collective."

Cloudera is what you might call a Red Hat for Hadoop. It offers its support and services for its own Hadoop distros. Tynt received consulting help from Cloudera when setting up a back-end platform based on the completely open source Cloudera Distribution of Hadoop, and it now pays Cloudera for support and updates.

At Tynt, HBase will initially be used to provide realtime API access to the service's analytics data, and it will eventually be used for other real-time tools as well. "HBase will also provide analytics, but much faster [than just MapReduce]." Befus says. MapReduce does batch processing; it doesn't provide real-time access to data.

Meanwhile, Yahoo! – which bootstrapped the Hadoop's core HDFS and MapReduce projects – is using HBase as part of its COKE system (Content Optimization Knowledge Engine), a means of automatically selecting news stories for its front page. Mozilla has moved its Socorro crash-reporting system HBase. And Adobe is using the platform to drive services across the company.

Like these outfits, Facebook is a longtime MySQL house. But its new messaging system – unveiled this past fall – uses HBase to juggle email, chat, and SMS as well as traditional on-site Facebook messages. HBase stores the text and metadata for messages as well as the indices needed to search them. The previous system needed about 75TB to store a month's worth of messages, and that figure will only grow with the new setup.

"The email workload is a write-dominated workload. We need to make a lot of writes very quickly," Facebook infrastructure guru Karthik Ranganathan said in a recent Facebook webcast. "We used HBase for the data that grows very fast, which is essentially the metadata."

But for all its success, HBase has lost one big-name user.

HBase was founded by Powerset, a San Francisco-based semantic search startup. Michael Stack was among the Powerset developers who helped get the project off the ground. In the summer of 2008, Microsoft acquired Powerset, and it eventually gave Stack and fellow committer Jim Kellerman the go-ahead to continue their contributions to the project.

"This is the first time we have acquired a company with committers to a key open source project who have been able to continue to commit to that project in their old capacity as part of their new role," Sam Ramji, Microsoft's then senior director of platform strategy told us at the time.

The HBase-based Powerset was folded into Bing, making the search engine one of the first "shipping" Microsoft product to actually include open source code. But a year an a half on, Powerset is no longer running on Hadoop. "As far as I know, there is no Hadoop or HBase in operation at Powerset these days," Stack says. And Microsoft has confirmed this with The Reg.

Hadoop, you see, doesn't really run on Windows. As much as things change at Microsoft, they stay the same. It was 13 years ago that Redmond purchased Hotmail, ripped out its FreeBSD servers, and replaced them with Windows 2000. ®

Providing a secure and efficient Helpdesk

More from The Register

next story
Business is back, baby! Hasta la VISTA, Win 8... Oh, yeah, Windows 9
Forget touchscreen millennials, Microsoft goes for mouse crowd
Microsoft WINDOWS 10: Seven ATE Nine. Or Eight did really
Windows NEIN skipped, tech preview due out on Wednesday
SMASH the Bash bug! Apple and Red Hat scramble for patch batches
'Applying multiple security updates is extremely difficult'
Apple: SO sorry for the iOS 8.0.1 UPDATE BUNGLE HORROR
Apple kills 'upgrade'. Hey, Microsoft. You sure you want to be like these guys?
ARM gives Internet of Things a piece of its mind – the Cortex-M7
32-bit core packs some DSP for VIP IoT CPU LOL
Microsoft on the Threshold of a new name for Windows next week
Rebranded OS reportedly set to be flung open by Redmond
Lotus Notes inventor Ozzie invents app to talk to people on your phone
Imagine that. Startup floats with voice collab app for Win iPhone
prev story

Whitepapers

A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Beginner's guide to SSL certificates
De-mystify the technology involved and give you the information you need to make the best decision when considering your online security options.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.