Feeds

HBase: Shops swap MySQL for open source Google mimic

Microsoft doesn't want it. But everyone else does

Providing a secure and efficient Helpdesk

Facebook isn't the only one swapping MySQL for HBase, the open source distributed database platform based on Google's BigTable. The Hadoopian HBase is now in play at several of the web's most recognizable names – including Adobe, Yahoo!, Mozilla, and StumbleUpon – as well as smaller operations looking to climb their way to such online prominence.

HBase can't replace all MySQL installations, says HBase project chair Michael Stack, but unlike MySQL, it's specifically designed to operate at ginormous scale.

HBase is part of the Apache Hadoop project, a sweeping effort to mimic Google's proprietary infrastructure with open source code. It dovetails with HDFS, the Hadoop distributed file system, and Hadoop MapReduce, the distributed number-crunching platform. HBase is essentially a low-latency layer that sits atop HDFS, letting you rapidly store and retrieve data. It's fashioned after Google's BigTable platform, which Mountain View publicly described in a 2006 research paper.

HBase project chair Michael Stack is on staff at StumbleUpon, which has long used HBase for the real-time public counters that track users and pageviews across its service. StumbleUpon still employs MySQL in many areas and will continue to do so. But the idea is to swap in HBase wherever scale is an issue.

"I don't foresee StumbleUpon ever giving up on all of its MySQL instances. RDBMSs are just too useful," Stack tells The Reg. "The plan, though, is to shrink what MySQL does over time, let MySQL do what its good at and have HBase take over where MySQL is running up against limits handling ever-growing write rates, table sizes, etc."

In similar fashion, Canadian startup Tynt is moving from MySQL to HBase and Hadoop so it can readily scale its service, which lets websites distribute URLs whenever netizens cut-and-paste content. The service is meant to generate extra traffic for sites, but it also provides sites with data describing all the traffic – and cutting-and-pasting – it sees. Tynt is now used by over 600,000 online publishers, with the company logging over 20,000 events per second, and according to company CTO Cameron Befus, Tynt's MySQL infrastructure couldn't keep up with the service's growth.

The company is now using HDFS and MapReduce to store and analyze all that data, and this month it will begin to use HBase to serve up the data in real time. "We were growing at an exponential rate. The volume of data we were called on to produce was more than doubling every month," Befus says. "We knew that MySQL couldn't really handle effectively what we had, let alone what we expected. ... We're exceeding 20,000 events per second, and you've got to spread that across a large number of MySQL servers, and as you do that, it becomes very inefficient."

What's more, says Amr Awadallah, vice president of engineering and CTO at Cloudera, the commercial Hadoop outfit that helped erect the company's Hadoop platform, simply adding MySQL servers is more difficult. "The headache is that every time you want to add a new MySQL server, it doesn't just assimilate into the collective easily," Awadallah explains.

"You have to repartition your data and rebalance your hashing technique across the new server and [specify] which range of keys now fall on that server and so on. With HBase, this happens transparently. You add nodes and you tell HBase you've added nodes and you join the collective."

Cloudera is what you might call a Red Hat for Hadoop. It offers its support and services for its own Hadoop distros. Tynt received consulting help from Cloudera when setting up a back-end platform based on the completely open source Cloudera Distribution of Hadoop, and it now pays Cloudera for support and updates.

At Tynt, HBase will initially be used to provide realtime API access to the service's analytics data, and it will eventually be used for other real-time tools as well. "HBase will also provide analytics, but much faster [than just MapReduce]." Befus says. MapReduce does batch processing; it doesn't provide real-time access to data.

Meanwhile, Yahoo! – which bootstrapped the Hadoop's core HDFS and MapReduce projects – is using HBase as part of its COKE system (Content Optimization Knowledge Engine), a means of automatically selecting news stories for its front page. Mozilla has moved its Socorro crash-reporting system HBase. And Adobe is using the platform to drive services across the company.

Like these outfits, Facebook is a longtime MySQL house. But its new messaging system – unveiled this past fall – uses HBase to juggle email, chat, and SMS as well as traditional on-site Facebook messages. HBase stores the text and metadata for messages as well as the indices needed to search them. The previous system needed about 75TB to store a month's worth of messages, and that figure will only grow with the new setup.

"The email workload is a write-dominated workload. We need to make a lot of writes very quickly," Facebook infrastructure guru Karthik Ranganathan said in a recent Facebook webcast. "We used HBase for the data that grows very fast, which is essentially the metadata."

But for all its success, HBase has lost one big-name user.

HBase was founded by Powerset, a San Francisco-based semantic search startup. Michael Stack was among the Powerset developers who helped get the project off the ground. In the summer of 2008, Microsoft acquired Powerset, and it eventually gave Stack and fellow committer Jim Kellerman the go-ahead to continue their contributions to the project.

"This is the first time we have acquired a company with committers to a key open source project who have been able to continue to commit to that project in their old capacity as part of their new role," Sam Ramji, Microsoft's then senior director of platform strategy told us at the time.

The HBase-based Powerset was folded into Bing, making the search engine one of the first "shipping" Microsoft product to actually include open source code. But a year an a half on, Powerset is no longer running on Hadoop. "As far as I know, there is no Hadoop or HBase in operation at Powerset these days," Stack says. And Microsoft has confirmed this with The Reg.

Hadoop, you see, doesn't really run on Windows. As much as things change at Microsoft, they stay the same. It was 13 years ago that Redmond purchased Hotmail, ripped out its FreeBSD servers, and replaced them with Windows 2000. ®

Beginner's guide to SSL certificates

More from The Register

next story
ONE MILLION people already running Windows 10
A third of them are doing it in VMs, but early feedback focuses on frippery
Sign off my IT project or I’ll PHONE your MUM
Honestly, it’s a piece of piss
Netscape Navigator - the browser that started it all - turns 20
It was 20 years ago today, Marc Andreeesen taught the band to play
Torvalds CONFESSES: 'I'm pretty good at alienating devs'
Admits to 'a metric ****load' of mistakes during work with Linux collaborators
Sway: Microsoft's new Office app doesn't have an Undo function
Content aggregation, meet the workplace ... oh
Do Moan! MONSTER 6-day EMAIL OUTAGE hits Domain Monster
Customers freaked out by frightful service
Ploppr: The #VultureTRENDING App of the Now
This organic crowd sourced viro- social fertiliser just got REAL
Return of the Jedi – Apache reclaims web server crown
.london, .hamburg and .公司 - that's .com in Chinese - storm the web server charts
NetWare sales revive in China thanks to that man Snowden
If it ain't Microsoft, it's in fashion behind the Great Firewall
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Win a year’s supply of chocolate
There is no techie angle to this competition so we're not going to pretend there is, but everyone loves chocolate so who cares.
Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.