Feeds

EMC lets go of Greenplum Community Edition

Uncrippled data warehouse development

High performance access to file storage

EMC's Greenplum data warehousing appliance and database division has a new Community Edition of its eponymous parallel database. The Community Edition replaces the single-node edition of the database, which was not as useful for companies trying to create parallel databases for warehouses and business analytics.

It also has some new features that will eventually make their way into the commercial version.

Greenplum Community Edition is based on the code used in the Greenplum Database 4.0 release, which is itself a heavily customized version of the PostgreSQL database, which has been parallelized to run across multiple server nodes, and optimized for crank through ad-hoc queries and other unnatural acts that companies want to do with the information that would otherwise be safely sequestered in their production ERP systems. Greenplum started out pairing its database with Sun Microsystems' Sun Fire X4500 Opteron-based servers, but in the wake of Sun's acquisition by Oracle last year, Greenplum added Dell PowerEdge servers to its certified iron list.

EMC acquired Greenplum in July 2010 to get into the database warehousing business, for much the same reason that Oracle bought Sun: these days, you need to tune the hardware to the software and the software to the hardware to get the best performance.

Dell's PowerEdge servers are used in the EMC Data Computing Appliance, which was announced last October, although there is no reason why the software stack cannot be sold on the Vblock servers (based on Cisco Systems UCS blade servers) that are sold by the VCE partnership between EMC, its virtualization minion VMware, and Cisco.

The Community Edition, which you can download here, is certified to run on Dell and Sun x64-based servers (the exact list of compatible machines was not available as El Reg went to press). The code for the database is not available as an open source product, just like the commercial-grade Greenplum 4.0 database itself. It is, however, available in a prepackaged VMware virtual machine container if you want to run it on your laptop or desktop in a single-node configuration. Like the commercial-grade Greenplum parallel database, the Community Edition is supported on Oracle's 64-bit Solaris 10, Red Hat's Enterprise Linux, and Novell's SUSE Linux Enterprise Server operating systems. A variety of HP, Dell, Sun, and IBM x64 boxes have been supported on various Greenplum database releases.

Up until now, developers had to make do with a single-node version of the database to develop their data warehouses and applications. To help seed its market of potential customers, Greenplum (and then parent EMC) made this single-node binary edition of the database and analytic tools available as a set of binaries that would only run on a single node. This single-node setup has had tens of thousands of downloads, Steven Hillion, data scientist at Greenplum, told El Reg.

This, of course, defeated the whole purpose of the Greenplum database, which is to parallelize PostgreSQL to radically speed up query performance. With the Community Edition, developers can throw the database across a cluster and actually get a sense of how it will perform before shelling out the cash for the commercial edition. Moreover, they can test on the complete data set, not on a representative subset of their data.

Hillion was hired by EMC to run the Greenplum analytics lab back in May 2010, and was charged with bringing new analytics tools to the database and data warehousing appliances. (Hillion was director of engineering at Siebel, Kana Software, QRS, and M-Factor and is a mathematician from the University of California at Berkeley.) The first new tool that is being bundled with is MADlib, an open source library of analytic algorithms that Greenplum has cobbled together to do predictive modeling and interpretive statistics on their data. At the moment, Greenplum's customers use SAS, Matlab, or the open source R tools to perform these functions.

"As the scale of the problems increases, while they still want to use these tools, they also want to do their models right where the data is," explains Hillion. The data sets that customers are wrestling with are so large that you can't easily move them from machine (or cluster) to machine (or cluster). Moreover, the R tool can only do data analysis on a data set that fits inside main memory, which is why a company called Revolution Analytics launched parallelized (and closed source) extensions to R last May and continues to enhance this tool and gain customers. Fair savvy database analysts are the intended users for the MADlib extensions; they are not for the faint of heart.

But a new graphical tool called Alpine Miner, which comes from a bunch of Greenplum expats who formed a company called Alpine Solution (no S), is aimed at those who are new to data analytics and who need some help setting up workflows so they can set tools loose chewing on data.

The MADlib tools are free to Community Edition users, and EMC is happy to sell support contracts for the tools for those who want to pay for them. Alpine Solution is selling the GUI add-on on a per user basis, but pricing was not available at press time.

Customers are not supposed to use Greenplum Community Edition on any machine that is put into production, but there is nothing technical preventing you from doing so. (The software is not crippled like the single-node version, which had governors on it preventing it from scaling across multiple nodes.) The Community Edition license says that a "production license is required when used for internal data processing or any commercial or production purposes on servers larger than a single physical server with up to two (2) CPU sockets or a single virtual machine with up to eight (8) virtual CPU cores."

It will be interesting to see how many people try to put Community Edition into production. ®

High performance access to file storage

More from The Register

next story
Seagate brings out 6TB HDD, did not need NO STEENKIN' SHINGLES
Or helium filling either, according to reports
European Court of Justice rips up Data Retention Directive
Rules 'interfering' measure to be 'invalid'
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Cisco reps flog Whiptail's Invicta arrays against EMC and Pure
Storage reseller report reveals who's selling what
Bored with trading oil and gold? Why not flog some CLOUD servers?
Chicago Mercantile Exchange plans cloud spot exchange
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
IT bods: How long does it take YOU to train up on new tech?
I'll leave my arrays to do the hard work, if you don't mind
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
prev story

Whitepapers

Mainstay ROI - Does application security pay?
In this whitepaper learn how you and your enterprise might benefit from better software security.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.