Feeds

Twitter to open source MySQL-to-Hadoop tool

Data Crane

Build a business case: developing custom apps

Hadoop Summit Twitter intends to open source an additional piece of the Hadoop-happy infrastructure it uses for internal data analysis. Known as Crane, this is a tool for moving data from MySQL into Hadoop, the open source data-crunching platform based on Google's proprietary infrastructure.

Twitter uses Hadoop for ad hoc analysis of data collected from its famous microblogging service, but the platform also crunches data for use by live tools on the site, including Twitter's name-search function.

Speaking today at the Yahoo!-sponsored Hadoop Summit in Santa Clara, California, Twitter analytics man Kevin Wiel explained that the company handles Hadoop data input in essentially two ways. It does log collection with the open source Scribe developed at Facebook, logging seven terabytes of data into the Hadoop File System (HDFS) each day, and it handles tabular data with Crane.

Most of Twitter's tabular data is stored in MySQL, though "a little" is stored in the Cassandra open source distributed database and Twitter's open source "social graph" data store, Flock. "Other than that," Wiel said. "Everything you do on Twitter ends up in a MySQL table somewhere."

Crane was developed to move data from MySQL to the HDFS or to the Hadoop-friendly distributed database known as HBase, but also to other MySQL databases. "We needed to have a flexible data-moving tool, so we built Crane, which is a configuration-driven ETL [extract, transform, and load] tool," Wiel says.

The tool moves data not only into MySQL, HDFS, and HBase, but also into Flock, Google Analytics, and Facebook Insights.

Like Yahoo! — and unlike Facebook — Twitter does its Hadoop programming in Pig. Developed by Yahoo!, the open source Pig is a lower-level language than the Facebook-developed Hive. But it operates at a significantly higher level than raw Hadoop MapReduce code.

According to Wiel, Pig requires five per cent of the coding and five percent of the code compared to Hadoop MapReduce, and it comes within 30 per cent of the execution time.

Twitter employees access Hadoop via dashboard known as BirdBrain, much like Facebookers use a Hive GUI known as HiPal.

A more general Hadoop interface was just open sourced by all-star startup Cloudera. Formerly known as the Cloudera Desktop, HUE — short for Hadoop User Interface — provides a web-based graphical user interface for creating and submitting jobs on a Hadoop cluster, monitoring the cluster's health, and browsing stored data. ®

Gartner critical capabilities for enterprise endpoint backup

More from The Register

next story
Why has the web gone to hell? Market chaos and HUMAN NATURE
Tim Berners-Lee isn't happy, but we should be
Microsoft boots 1,500 dodgy apps from the Windows Store
DEVELOPERS! DEVELOPERS! DEVELOPERS! Naughty, misleading developers!
'Stop dissing Google or quit': OK, I quit, says Code Club co-founder
And now a message from our sponsors: 'STFU or else'
Apple promises to lift Curse of the Drained iPhone 5 Battery
Have you tried turning it off and...? Never mind, here's a replacement
Mozilla's 'Tiles' ads debut in new Firefox nightlies
You can try turning them off and on again
Linux turns 23 and Linus Torvalds celebrates as only he can
No, not with swearing, but by controlling the release cycle
Scratched PC-dispatch patch patched, hatched in batch rematch
Windows security update fixed after triggering blue screens (and screams) of death
This is how I set about making a fortune with my own startup
Would you leave your well-paid job to chase your dream?
prev story

Whitepapers

Top 10 endpoint backup mistakes
Avoid the ten endpoint backup mistakes to ensure that your critical corporate data is protected and end user productivity is improved.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Backing up distributed data
Eliminating the redundant use of bandwidth and storage capacity and application consolidation in the modern data center.
The essential guide to IT transformation
ServiceNow discusses three IT transformations that can help CIOs automate IT services to transform IT and the enterprise
Next gen security for virtualised datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.