Feeds

Hey, Presto! Facebook spills petabyte-munching SQL brain sauce online

Zuck on that, Hortonworks and Cloudera

Secure remote control for conventional and virtual desktops

Facebook has come through on its promise to publish its Hive-beating "Presto" analysis software as open source.

The code was made available by the social network today under the Apache v2 license, giving developers access to an ANSI-SQL compatible data query and analysis engine that is faster than Apache Hive, and competes with Cloudera's Impala and Hortonworks's Stinger technologies.

Facebook uses the tool for graph analytics, machine-learning, and short turnaround queries. The system has CPU performance four to seven times better than the Hive batch Hadoop cruncher, and returns query results eight to ten times faster.

Though it is designed to process data meant for Hive (or general Hadoop), it has "pluggable backends" that let it ingest info from other sources.

Facebook reckons the system could be relevant for people with 750GB or more of data needing analysis.

"It has also allowed us to provide a uniform SQL interface over multiple data backends such as HDFS, Hbase, Scribe, and an internal in-memory data store," a spokesperson told El Reg via email.

Unlike Hive, the Presto system does not depend on an underlying MapReduce compute framework, which Facebook says has led to improved scheduling. This has helped it work as a quick-turnaround system for interactive queries, rather than the batch processing jobs Hive is designed for.

The Java-based system works by parsing an ANSI-SQL query into a distributed query plan. It then spins up dedicated workers for multiple slices of data which it pulls from the underlying Hadoop File System (HDFS). Each worker runs a process that contains custom bytecode designed to increase execution efficiency. Data is stored and processed in-memory, and pipelined across the network between stages.

'Still a somewhat manual process to unpack and install'

"Through careful use of memory and data structures, Presto avoids typical issues of Java code related to memory allocation and garbage collection. (In a later post, we will share some tips and tricks for writing high-performance Java system code and the lessons learned while," the company wrote in a blog post announcing the publication of Presto as open source.

Facebook put Presto into production in early 2013, and the system now has over 1,000 users performing 30,000 queries that handle at least a petabyte of data per day, the company said. This is up from the 850 users and 27,000 daily queries the company claimed in June when it first told El Reg about Presto. Since then, Facebook's data warehouse has ballooned from 250PB to 300PB in size, and Presto is being used to query all of it.

One developer who has used the software told us that the technology is usable, and that "it is still a somewhat manual process to unpack and install, but I was able to do so within 12 minutes on a few boxes, but this is the sort of thing that runs on clusters of tens, hundreds, or thousands."

"Presto works better at Facebook scale and for our use cases," a spokesperson told El Reg via email. Other web companies have had a chance to play with it as well, and we were sent canned quotes from companies like Airbnb and Dropbox.

"It's an order of magnitude faster than Hive in most of our use cases," Airbnb data scientist Chris Gutierrez said. "It reads directly from HDFS, so unlike [Amazon Web Services] Redshift, there isn't a lot of ETL [extract, transform, and load] before you can use it. It just works." ®

Beginner's guide to SSL certificates

More from The Register

next story
NSA SOURCE CODE LEAK: Information slurp tools to appear online
Now you can run your own intelligence agency
Azure TITSUP caused by INFINITE LOOP
Fat fingered geo-block kept Aussies in the dark
Yahoo! blames! MONSTER! email! OUTAGE! on! CUT! CABLE! bungle!
Weekend woe for BT as telco struggles to restore service
Cloud unicorns are extinct so DiData cloud mess was YOUR fault
Applications need to be built to handle TITSUP incidents
Stop the IoT revolution! We need to figure out packet sizes first
Researchers test 802.15.4 and find we know nuh-think! about large scale sensor network ops
Turnbull should spare us all airline-magazine-grade cloud hype
Box-hugger is not a dirty word, Minister. Box-huggers make the cloud WORK
SanDisk vows: We'll have a 16TB SSD WHOPPER by 2016
Flash WORM has a serious use for archived photos and videos
Astro-boffins start opening universe simulation data
Got a supercomputer? Want to simulate a universe? Here you go
Microsoft adds video offering to Office 365. Oh NOES, you'll need Adobe Flash
Lovely presentations... but not on your Flash-hating mobe
prev story

Whitepapers

Free virtual appliance for wire data analytics
The ExtraHop Discovery Edition is a free virtual appliance will help you to discover the performance of your applications across the network, web, VDI, database, and storage tiers.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
5 critical considerations for enterprise cloud backup
Key considerations when evaluating cloud backup solutions to ensure adequate protection security and availability of enterprise data.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Security and trust: The backbone of doing business over the internet
Explores the current state of website security and the contributions Symantec is making to help organizations protect critical data and build trust with customers.