EMC lets go of Greenplum Community Edition

Uncrippled data warehouse development

Application security programs and practises

EMC's Greenplum data warehousing appliance and database division has a new Community Edition of its eponymous parallel database. The Community Edition replaces the single-node edition of the database, which was not as useful for companies trying to create parallel databases for warehouses and business analytics.

It also has some new features that will eventually make their way into the commercial version.

Greenplum Community Edition is based on the code used in the Greenplum Database 4.0 release, which is itself a heavily customized version of the PostgreSQL database, which has been parallelized to run across multiple server nodes, and optimized for crank through ad-hoc queries and other unnatural acts that companies want to do with the information that would otherwise be safely sequestered in their production ERP systems. Greenplum started out pairing its database with Sun Microsystems' Sun Fire X4500 Opteron-based servers, but in the wake of Sun's acquisition by Oracle last year, Greenplum added Dell PowerEdge servers to its certified iron list.

EMC acquired Greenplum in July 2010 to get into the database warehousing business, for much the same reason that Oracle bought Sun: these days, you need to tune the hardware to the software and the software to the hardware to get the best performance.

Dell's PowerEdge servers are used in the EMC Data Computing Appliance, which was announced last October, although there is no reason why the software stack cannot be sold on the Vblock servers (based on Cisco Systems UCS blade servers) that are sold by the VCE partnership between EMC, its virtualization minion VMware, and Cisco.

The Community Edition, which you can download here, is certified to run on Dell and Sun x64-based servers (the exact list of compatible machines was not available as El Reg went to press). The code for the database is not available as an open source product, just like the commercial-grade Greenplum 4.0 database itself. It is, however, available in a prepackaged VMware virtual machine container if you want to run it on your laptop or desktop in a single-node configuration. Like the commercial-grade Greenplum parallel database, the Community Edition is supported on Oracle's 64-bit Solaris 10, Red Hat's Enterprise Linux, and Novell's SUSE Linux Enterprise Server operating systems. A variety of HP, Dell, Sun, and IBM x64 boxes have been supported on various Greenplum database releases.

Up until now, developers had to make do with a single-node version of the database to develop their data warehouses and applications. To help seed its market of potential customers, Greenplum (and then parent EMC) made this single-node binary edition of the database and analytic tools available as a set of binaries that would only run on a single node. This single-node setup has had tens of thousands of downloads, Steven Hillion, data scientist at Greenplum, told El Reg.

This, of course, defeated the whole purpose of the Greenplum database, which is to parallelize PostgreSQL to radically speed up query performance. With the Community Edition, developers can throw the database across a cluster and actually get a sense of how it will perform before shelling out the cash for the commercial edition. Moreover, they can test on the complete data set, not on a representative subset of their data.

Hillion was hired by EMC to run the Greenplum analytics lab back in May 2010, and was charged with bringing new analytics tools to the database and data warehousing appliances. (Hillion was director of engineering at Siebel, Kana Software, QRS, and M-Factor and is a mathematician from the University of California at Berkeley.) The first new tool that is being bundled with is MADlib, an open source library of analytic algorithms that Greenplum has cobbled together to do predictive modeling and interpretive statistics on their data. At the moment, Greenplum's customers use SAS, Matlab, or the open source R tools to perform these functions.

"As the scale of the problems increases, while they still want to use these tools, they also want to do their models right where the data is," explains Hillion. The data sets that customers are wrestling with are so large that you can't easily move them from machine (or cluster) to machine (or cluster). Moreover, the R tool can only do data analysis on a data set that fits inside main memory, which is why a company called Revolution Analytics launched parallelized (and closed source) extensions to R last May and continues to enhance this tool and gain customers. Fair savvy database analysts are the intended users for the MADlib extensions; they are not for the faint of heart.

But a new graphical tool called Alpine Miner, which comes from a bunch of Greenplum expats who formed a company called Alpine Solution (no S), is aimed at those who are new to data analytics and who need some help setting up workflows so they can set tools loose chewing on data.

The MADlib tools are free to Community Edition users, and EMC is happy to sell support contracts for the tools for those who want to pay for them. Alpine Solution is selling the GUI add-on on a per user basis, but pricing was not available at press time.

Customers are not supposed to use Greenplum Community Edition on any machine that is put into production, but there is nothing technical preventing you from doing so. (The software is not crippled like the single-node version, which had governors on it preventing it from scaling across multiple nodes.) The Community Edition license says that a "production license is required when used for internal data processing or any commercial or production purposes on servers larger than a single physical server with up to two (2) CPU sockets or a single virtual machine with up to eight (8) virtual CPU cores."

It will be interesting to see how many people try to put Community Edition into production. ®

Eight steps to building an HP BladeSystem

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
SHOCK and AWS: The fall of Amazon's deflationary cloud
Just as Jeff Bezos did to books and CDs, Amazon's rivals are now doing to it
Amazon Reveals One Weird Trick: A Loss On Almost $20bn In Sales
Investors really hate it: Share price plunge as growth SLOWS in key AWS division
US judge: YES, cops or feds so can slurp an ENTIRE Gmail account
Crooks don't have folders labelled 'drug records', opines NY beak
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
Manic malware Mayhem spreads through Linux, FreeBSD web servers
And how Google could cripple infection rate in a second
BlackBerry: Toss the server, mate... BES is in the CLOUD now
BlackBerry Enterprise Services takes aim at SMEs - but there's a catch
The triumph of VVOL: Everyone's jumping into bed with VMware
'Bandwagon'? Yes, we're on it and so what, say big dogs
prev story


Top three mobile application threats
Prevent sensitive data leakage over insecure channels or stolen mobile devices.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
Designing a Defense for Mobile Applications
Learn about the various considerations for defending mobile applications - from the application architecture itself to the myriad testing technologies.
Build a business case: developing custom apps
Learn how to maximize the value of custom applications by accelerating and simplifying their development.