Hey, Presto! Facebook spills petabyte-munching SQL brain sauce online
Zuck on that, Hortonworks and Cloudera
Facebook has come through on its promise to publish its Hive-beating "Presto" analysis software as open source.
The code was made available by the social network today under the Apache v2 license, giving developers access to an ANSI-SQL compatible data query and analysis engine that is faster than Apache Hive, and competes with Cloudera's Impala and Hortonworks's Stinger technologies.
Facebook uses the tool for graph analytics, machine-learning, and short turnaround queries. The system has CPU performance four to seven times better than the Hive batch Hadoop cruncher, and returns query results eight to ten times faster.
Though it is designed to process data meant for Hive (or general Hadoop), it has "pluggable backends" that let it ingest info from other sources.
Facebook reckons the system could be relevant for people with 750GB or more of data needing analysis.
"It has also allowed us to provide a uniform SQL interface over multiple data backends such as HDFS, Hbase, Scribe, and an internal in-memory data store," a spokesperson told El Reg via email.
Unlike Hive, the Presto system does not depend on an underlying MapReduce compute framework, which Facebook says has led to improved scheduling. This has helped it work as a quick-turnaround system for interactive queries, rather than the batch processing jobs Hive is designed for.
The Java-based system works by parsing an ANSI-SQL query into a distributed query plan. It then spins up dedicated workers for multiple slices of data which it pulls from the underlying Hadoop File System (HDFS). Each worker runs a process that contains custom bytecode designed to increase execution efficiency. Data is stored and processed in-memory, and pipelined across the network between stages.
'Still a somewhat manual process to unpack and install'
"Through careful use of memory and data structures, Presto avoids typical issues of Java code related to memory allocation and garbage collection. (In a later post, we will share some tips and tricks for writing high-performance Java system code and the lessons learned while," the company wrote in a blog post announcing the publication of Presto as open source.
Facebook put Presto into production in early 2013, and the system now has over 1,000 users performing 30,000 queries that handle at least a petabyte of data per day, the company said. This is up from the 850 users and 27,000 daily queries the company claimed in June when it first told El Reg about Presto. Since then, Facebook's data warehouse has ballooned from 250PB to 300PB in size, and Presto is being used to query all of it.
One developer who has used the software told us that the technology is usable, and that "it is still a somewhat manual process to unpack and install, but I was able to do so within 12 minutes on a few boxes, but this is the sort of thing that runs on clusters of tens, hundreds, or thousands."
"Presto works better at Facebook scale and for our use cases," a spokesperson told El Reg via email. Other web companies have had a chance to play with it as well, and we were sent canned quotes from companies like Airbnb and Dropbox.
"It's an order of magnitude faster than Hive in most of our use cases," Airbnb data scientist Chris Gutierrez said. "It reads directly from HDFS, so unlike [Amazon Web Services] Redshift, there isn't a lot of ETL [extract, transform, and load] before you can use it. It just works." ®