Feeds

EMC lets go of Greenplum Community Edition

Uncrippled data warehouse development

Protecting against web application threats using SSL

EMC's Greenplum data warehousing appliance and database division has a new Community Edition of its eponymous parallel database. The Community Edition replaces the single-node edition of the database, which was not as useful for companies trying to create parallel databases for warehouses and business analytics.

It also has some new features that will eventually make their way into the commercial version.

Greenplum Community Edition is based on the code used in the Greenplum Database 4.0 release, which is itself a heavily customized version of the PostgreSQL database, which has been parallelized to run across multiple server nodes, and optimized for crank through ad-hoc queries and other unnatural acts that companies want to do with the information that would otherwise be safely sequestered in their production ERP systems. Greenplum started out pairing its database with Sun Microsystems' Sun Fire X4500 Opteron-based servers, but in the wake of Sun's acquisition by Oracle last year, Greenplum added Dell PowerEdge servers to its certified iron list.

EMC acquired Greenplum in July 2010 to get into the database warehousing business, for much the same reason that Oracle bought Sun: these days, you need to tune the hardware to the software and the software to the hardware to get the best performance.

Dell's PowerEdge servers are used in the EMC Data Computing Appliance, which was announced last October, although there is no reason why the software stack cannot be sold on the Vblock servers (based on Cisco Systems UCS blade servers) that are sold by the VCE partnership between EMC, its virtualization minion VMware, and Cisco.

The Community Edition, which you can download here, is certified to run on Dell and Sun x64-based servers (the exact list of compatible machines was not available as El Reg went to press). The code for the database is not available as an open source product, just like the commercial-grade Greenplum 4.0 database itself. It is, however, available in a prepackaged VMware virtual machine container if you want to run it on your laptop or desktop in a single-node configuration. Like the commercial-grade Greenplum parallel database, the Community Edition is supported on Oracle's 64-bit Solaris 10, Red Hat's Enterprise Linux, and Novell's SUSE Linux Enterprise Server operating systems. A variety of HP, Dell, Sun, and IBM x64 boxes have been supported on various Greenplum database releases.

Up until now, developers had to make do with a single-node version of the database to develop their data warehouses and applications. To help seed its market of potential customers, Greenplum (and then parent EMC) made this single-node binary edition of the database and analytic tools available as a set of binaries that would only run on a single node. This single-node setup has had tens of thousands of downloads, Steven Hillion, data scientist at Greenplum, told El Reg.

This, of course, defeated the whole purpose of the Greenplum database, which is to parallelize PostgreSQL to radically speed up query performance. With the Community Edition, developers can throw the database across a cluster and actually get a sense of how it will perform before shelling out the cash for the commercial edition. Moreover, they can test on the complete data set, not on a representative subset of their data.

Hillion was hired by EMC to run the Greenplum analytics lab back in May 2010, and was charged with bringing new analytics tools to the database and data warehousing appliances. (Hillion was director of engineering at Siebel, Kana Software, QRS, and M-Factor and is a mathematician from the University of California at Berkeley.) The first new tool that is being bundled with is MADlib, an open source library of analytic algorithms that Greenplum has cobbled together to do predictive modeling and interpretive statistics on their data. At the moment, Greenplum's customers use SAS, Matlab, or the open source R tools to perform these functions.

"As the scale of the problems increases, while they still want to use these tools, they also want to do their models right where the data is," explains Hillion. The data sets that customers are wrestling with are so large that you can't easily move them from machine (or cluster) to machine (or cluster). Moreover, the R tool can only do data analysis on a data set that fits inside main memory, which is why a company called Revolution Analytics launched parallelized (and closed source) extensions to R last May and continues to enhance this tool and gain customers. Fair savvy database analysts are the intended users for the MADlib extensions; they are not for the faint of heart.

But a new graphical tool called Alpine Miner, which comes from a bunch of Greenplum expats who formed a company called Alpine Solution (no S), is aimed at those who are new to data analytics and who need some help setting up workflows so they can set tools loose chewing on data.

The MADlib tools are free to Community Edition users, and EMC is happy to sell support contracts for the tools for those who want to pay for them. Alpine Solution is selling the GUI add-on on a per user basis, but pricing was not available at press time.

Customers are not supposed to use Greenplum Community Edition on any machine that is put into production, but there is nothing technical preventing you from doing so. (The software is not crippled like the single-node version, which had governors on it preventing it from scaling across multiple nodes.) The Community Edition license says that a "production license is required when used for internal data processing or any commercial or production purposes on servers larger than a single physical server with up to two (2) CPU sockets or a single virtual machine with up to eight (8) virtual CPU cores."

It will be interesting to see how many people try to put Community Edition into production. ®

Choosing a cloud hosting partner with confidence

More from The Register

next story
Wanna keep your data for 1,000 YEARS? No? Hard luck, HDS wants you to anyway
Combine Blu-ray and M-DISC and you get this monster
US boffins demo 'twisted radio' mux
OAM takes wireless signals to 32 Gbps
'Kim Kardashian snaps naked selfies with a BLACKBERRY'. *Twitterati gasps*
More alleged private, nude celeb pics appear online
Google+ GOING, GOING ... ? Newbie Gmailers no longer forced into mandatory ID slurp
Mountain View distances itself from lame 'network thingy'
Apple flops out 2FA for iCloud in bid to stop future nude selfie leaks
Millions of 4chan users howl with laughter as Cupertino slams stable door
Students playing with impressive racks? Yes, it's cluster comp time
The most comprehensive coverage the world has ever seen. Ever
Run little spreadsheet, run! IBM's Watson is coming to gobble you up
Big Blue's big super's big appetite for big data in big clouds for big analytics
Seagate's triple-headed Cerberus could SAVE the DISK WORLD
... and possibly bring us even more HAMR time. Yay!
prev story

Whitepapers

Secure remote control for conventional and virtual desktops
Balancing user privacy and privileged access, in accordance with compliance frameworks and legislation. Evaluating any potential remote control choice.
WIN a very cool portable ZX Spectrum
Win a one-off portable Spectrum built by legendary hardware hacker Ben Heck
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
The next step in data security
With recent increased privacy concerns and computers becoming more powerful, the chance of hackers being able to crack smaller-sized RSA keys increases.