Feeds

Hey, Presto! Facebook spills petabyte-munching SQL brain sauce online

Zuck on that, Hortonworks and Cloudera

Top 5 reasons to deploy VMware with Tegile

Facebook has come through on its promise to publish its Hive-beating "Presto" analysis software as open source.

The code was made available by the social network today under the Apache v2 license, giving developers access to an ANSI-SQL compatible data query and analysis engine that is faster than Apache Hive, and competes with Cloudera's Impala and Hortonworks's Stinger technologies.

Facebook uses the tool for graph analytics, machine-learning, and short turnaround queries. The system has CPU performance four to seven times better than the Hive batch Hadoop cruncher, and returns query results eight to ten times faster.

Though it is designed to process data meant for Hive (or general Hadoop), it has "pluggable backends" that let it ingest info from other sources.

Facebook reckons the system could be relevant for people with 750GB or more of data needing analysis.

"It has also allowed us to provide a uniform SQL interface over multiple data backends such as HDFS, Hbase, Scribe, and an internal in-memory data store," a spokesperson told El Reg via email.

Unlike Hive, the Presto system does not depend on an underlying MapReduce compute framework, which Facebook says has led to improved scheduling. This has helped it work as a quick-turnaround system for interactive queries, rather than the batch processing jobs Hive is designed for.

The Java-based system works by parsing an ANSI-SQL query into a distributed query plan. It then spins up dedicated workers for multiple slices of data which it pulls from the underlying Hadoop File System (HDFS). Each worker runs a process that contains custom bytecode designed to increase execution efficiency. Data is stored and processed in-memory, and pipelined across the network between stages.

'Still a somewhat manual process to unpack and install'

"Through careful use of memory and data structures, Presto avoids typical issues of Java code related to memory allocation and garbage collection. (In a later post, we will share some tips and tricks for writing high-performance Java system code and the lessons learned while," the company wrote in a blog post announcing the publication of Presto as open source.

Facebook put Presto into production in early 2013, and the system now has over 1,000 users performing 30,000 queries that handle at least a petabyte of data per day, the company said. This is up from the 850 users and 27,000 daily queries the company claimed in June when it first told El Reg about Presto. Since then, Facebook's data warehouse has ballooned from 250PB to 300PB in size, and Presto is being used to query all of it.

One developer who has used the software told us that the technology is usable, and that "it is still a somewhat manual process to unpack and install, but I was able to do so within 12 minutes on a few boxes, but this is the sort of thing that runs on clusters of tens, hundreds, or thousands."

"Presto works better at Facebook scale and for our use cases," a spokesperson told El Reg via email. Other web companies have had a chance to play with it as well, and we were sent canned quotes from companies like Airbnb and Dropbox.

"It's an order of magnitude faster than Hive in most of our use cases," Airbnb data scientist Chris Gutierrez said. "It reads directly from HDFS, so unlike [Amazon Web Services] Redshift, there isn't a lot of ETL [extract, transform, and load] before you can use it. It just works." ®

Beginner's guide to SSL certificates

More from The Register

next story
It's Big, it's Blue... it's simply FABLESS! IBM's chip-free future
Or why the reversal of globalisation ain't gonna 'appen
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
Microsoft and Dell’s cloud in a box: Instant Azure for the data centre
A less painful way to run Microsoft’s private cloud
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
CAGE MATCH: Microsoft, Dell open co-located bit barns in Oz
Whole new species of XaaS spawning in the antipodes
prev story

Whitepapers

Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.