Facebook Open Sources Presto for Munching Petabytes of Data

Facebook unveiled Presto, a SQL-on-Hadoop engine that it developed in-house, back in June this year. The SQL engine is capable of doing fast interactive analysis on the social networking site’s enormous 250-petabyte-and-growing data warehouse, with processing speed 10 times faster than Hive.

Today the company has open sourced Presto and the code was made available today under the Apache v2 license. According to Facebook, Presto is “ten times better” than alternatives such as Hive when it comes to CPU efficiency and latency for a large number of queries.

“It currently supports a large subset of ANSI SQL, including joins, left/right outer joins, sub-queries, and most of the common aggregate and scalar functions, including approximate distinct counts (using HyperLogLog) and approximate percentiles (based on quantile digest),” Martin Traverso, a software engineer at Facebook said.


Facebook initially relied on Hadoop MapReduce along with Hive, however, as users increased and its data kept multiplying, the approached seemed very slow. To overcome this issue, Facebook started the development of Presto in the fall of 2012 and was released to Facebook employees last spring. Facebook says that the engine is used by over 1000 employees, running over 30,000 queries on a daily basis.

So, who can use Presto? Well, if you’re a business with 750GB or more data, Presto could be the right choice for you, and Facebook estimates that the system could be relevant for such businesses.

Presto, unlike Hive, does not depend on MapReduce computing framework, which in fact has led to improved scheduling, says Facebook. The software is already being tested by a number of other large Internet services, namely AirBnB and Dropbox.

You can get the source code here.