EMC lets go of Greenplum Community Edition
Uncrippled data warehouse development
EMC's Greenplum data warehousing appliance and database division has a new Community Edition of its eponymous parallel database. The Community Edition replaces the single-node edition of the database, which was not as useful for companies trying to create parallel databases for warehouses and business analytics.
It also has some new features that will eventually make their way into the commercial version.
Greenplum Community Edition is based on the code used in the Greenplum Database 4.0 release, which is itself a heavily customized version of the PostgreSQL database, which has been parallelized to run across multiple server nodes, and optimized for crank through ad-hoc queries and other unnatural acts that companies want to do with the information that would otherwise be safely sequestered in their production ERP systems. Greenplum started out pairing its database with Sun Microsystems' Sun Fire X4500 Opteron-based servers, but in the wake of Sun's acquisition by Oracle last year, Greenplum added Dell PowerEdge servers to its certified iron list.
EMC acquired Greenplum in July 2010 to get into the database warehousing business, for much the same reason that Oracle bought Sun: these days, you need to tune the hardware to the software and the software to the hardware to get the best performance.
Dell's PowerEdge servers are used in the EMC Data Computing Appliance, which was announced last October, although there is no reason why the software stack cannot be sold on the Vblock servers (based on Cisco Systems UCS blade servers) that are sold by the VCE partnership between EMC, its virtualization minion VMware, and Cisco.
The Community Edition, which you can download here, is certified to run on Dell and Sun x64-based servers (the exact list of compatible machines was not available as El Reg went to press). The code for the database is not available as an open source product, just like the commercial-grade Greenplum 4.0 database itself. It is, however, available in a prepackaged VMware virtual machine container if you want to run it on your laptop or desktop in a single-node configuration. Like the commercial-grade Greenplum parallel database, the Community Edition is supported on Oracle's 64-bit Solaris 10, Red Hat's Enterprise Linux, and Novell's SUSE Linux Enterprise Server operating systems. A variety of HP, Dell, Sun, and IBM x64 boxes have been supported on various Greenplum database releases.
Up until now, developers had to make do with a single-node version of the database to develop their data warehouses and applications. To help seed its market of potential customers, Greenplum (and then parent EMC) made this single-node binary edition of the database and analytic tools available as a set of binaries that would only run on a single node. This single-node setup has had tens of thousands of downloads, Steven Hillion, data scientist at Greenplum, told El Reg.
This, of course, defeated the whole purpose of the Greenplum database, which is to parallelize PostgreSQL to radically speed up query performance. With the Community Edition, developers can throw the database across a cluster and actually get a sense of how it will perform before shelling out the cash for the commercial edition. Moreover, they can test on the complete data set, not on a representative subset of their data.
Hillion was hired by EMC to run the Greenplum analytics lab back in May 2010, and was charged with bringing new analytics tools to the database and data warehousing appliances. (Hillion was director of engineering at Siebel, Kana Software, QRS, and M-Factor and is a mathematician from the University of California at Berkeley.) The first new tool that is being bundled with is MADlib, an open source library of analytic algorithms that Greenplum has cobbled together to do predictive modeling and interpretive statistics on their data. At the moment, Greenplum's customers use SAS, Matlab, or the open source R tools to perform these functions.
"As the scale of the problems increases, while they still want to use these tools, they also want to do their models right where the data is," explains Hillion. The data sets that customers are wrestling with are so large that you can't easily move them from machine (or cluster) to machine (or cluster). Moreover, the R tool can only do data analysis on a data set that fits inside main memory, which is why a company called Revolution Analytics launched parallelized (and closed source) extensions to R last May and continues to enhance this tool and gain customers. Fair savvy database analysts are the intended users for the MADlib extensions; they are not for the faint of heart.
But a new graphical tool called Alpine Miner, which comes from a bunch of Greenplum expats who formed a company called Alpine Solution (no S), is aimed at those who are new to data analytics and who need some help setting up workflows so they can set tools loose chewing on data.
The MADlib tools are free to Community Edition users, and EMC is happy to sell support contracts for the tools for those who want to pay for them. Alpine Solution is selling the GUI add-on on a per user basis, but pricing was not available at press time.
Customers are not supposed to use Greenplum Community Edition on any machine that is put into production, but there is nothing technical preventing you from doing so. (The software is not crippled like the single-node version, which had governors on it preventing it from scaling across multiple nodes.) The Community Edition license says that a "production license is required when used for internal data processing or any commercial or production purposes on servers larger than a single physical server with up to two (2) CPU sockets or a single virtual machine with up to eight (8) virtual CPU cores."
It will be interesting to see how many people try to put Community Edition into production. ®