Feeds

Hey, Presto! Facebook spills petabyte-munching SQL brain sauce online

Zuck on that, Hortonworks and Cloudera

Next gen security for virtualised datacentres

Facebook has come through on its promise to publish its Hive-beating "Presto" analysis software as open source.

The code was made available by the social network today under the Apache v2 license, giving developers access to an ANSI-SQL compatible data query and analysis engine that is faster than Apache Hive, and competes with Cloudera's Impala and Hortonworks's Stinger technologies.

Facebook uses the tool for graph analytics, machine-learning, and short turnaround queries. The system has CPU performance four to seven times better than the Hive batch Hadoop cruncher, and returns query results eight to ten times faster.

Though it is designed to process data meant for Hive (or general Hadoop), it has "pluggable backends" that let it ingest info from other sources.

Facebook reckons the system could be relevant for people with 750GB or more of data needing analysis.

"It has also allowed us to provide a uniform SQL interface over multiple data backends such as HDFS, Hbase, Scribe, and an internal in-memory data store," a spokesperson told El Reg via email.

Unlike Hive, the Presto system does not depend on an underlying MapReduce compute framework, which Facebook says has led to improved scheduling. This has helped it work as a quick-turnaround system for interactive queries, rather than the batch processing jobs Hive is designed for.

The Java-based system works by parsing an ANSI-SQL query into a distributed query plan. It then spins up dedicated workers for multiple slices of data which it pulls from the underlying Hadoop File System (HDFS). Each worker runs a process that contains custom bytecode designed to increase execution efficiency. Data is stored and processed in-memory, and pipelined across the network between stages.

'Still a somewhat manual process to unpack and install'

"Through careful use of memory and data structures, Presto avoids typical issues of Java code related to memory allocation and garbage collection. (In a later post, we will share some tips and tricks for writing high-performance Java system code and the lessons learned while," the company wrote in a blog post announcing the publication of Presto as open source.

Facebook put Presto into production in early 2013, and the system now has over 1,000 users performing 30,000 queries that handle at least a petabyte of data per day, the company said. This is up from the 850 users and 27,000 daily queries the company claimed in June when it first told El Reg about Presto. Since then, Facebook's data warehouse has ballooned from 250PB to 300PB in size, and Presto is being used to query all of it.

One developer who has used the software told us that the technology is usable, and that "it is still a somewhat manual process to unpack and install, but I was able to do so within 12 minutes on a few boxes, but this is the sort of thing that runs on clusters of tens, hundreds, or thousands."

"Presto works better at Facebook scale and for our use cases," a spokesperson told El Reg via email. Other web companies have had a chance to play with it as well, and we were sent canned quotes from companies like Airbnb and Dropbox.

"It's an order of magnitude faster than Hive in most of our use cases," Airbnb data scientist Chris Gutierrez said. "It reads directly from HDFS, so unlike [Amazon Web Services] Redshift, there isn't a lot of ETL [extract, transform, and load] before you can use it. It just works." ®

The essential guide to IT transformation

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Object storage bods Exablox: RAID is dead, baby. RAID is dead
Bring your own disks to its object appliances
Nimble's latest mutants GORGE themselves on unlucky forerunners
Crossing Sandy Bridges without stopping for breath
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
7 Elements of Radically Simple OS Migration
Avoid the typical headaches of OS migration during your next project by learning about 7 elements of radically simple OS migration.
BYOD's dark side: Data protection
An endpoint data protection solution that adds value to the user and the organization so it can protect itself from data loss as well as leverage corporate data.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?