Feeds

EMC lets go of Greenplum Community Edition

Uncrippled data warehouse development

Combat fraud and increase customer satisfaction

EMC's Greenplum data warehousing appliance and database division has a new Community Edition of its eponymous parallel database. The Community Edition replaces the single-node edition of the database, which was not as useful for companies trying to create parallel databases for warehouses and business analytics.

It also has some new features that will eventually make their way into the commercial version.

Greenplum Community Edition is based on the code used in the Greenplum Database 4.0 release, which is itself a heavily customized version of the PostgreSQL database, which has been parallelized to run across multiple server nodes, and optimized for crank through ad-hoc queries and other unnatural acts that companies want to do with the information that would otherwise be safely sequestered in their production ERP systems. Greenplum started out pairing its database with Sun Microsystems' Sun Fire X4500 Opteron-based servers, but in the wake of Sun's acquisition by Oracle last year, Greenplum added Dell PowerEdge servers to its certified iron list.

EMC acquired Greenplum in July 2010 to get into the database warehousing business, for much the same reason that Oracle bought Sun: these days, you need to tune the hardware to the software and the software to the hardware to get the best performance.

Dell's PowerEdge servers are used in the EMC Data Computing Appliance, which was announced last October, although there is no reason why the software stack cannot be sold on the Vblock servers (based on Cisco Systems UCS blade servers) that are sold by the VCE partnership between EMC, its virtualization minion VMware, and Cisco.

The Community Edition, which you can download here, is certified to run on Dell and Sun x64-based servers (the exact list of compatible machines was not available as El Reg went to press). The code for the database is not available as an open source product, just like the commercial-grade Greenplum 4.0 database itself. It is, however, available in a prepackaged VMware virtual machine container if you want to run it on your laptop or desktop in a single-node configuration. Like the commercial-grade Greenplum parallel database, the Community Edition is supported on Oracle's 64-bit Solaris 10, Red Hat's Enterprise Linux, and Novell's SUSE Linux Enterprise Server operating systems. A variety of HP, Dell, Sun, and IBM x64 boxes have been supported on various Greenplum database releases.

Up until now, developers had to make do with a single-node version of the database to develop their data warehouses and applications. To help seed its market of potential customers, Greenplum (and then parent EMC) made this single-node binary edition of the database and analytic tools available as a set of binaries that would only run on a single node. This single-node setup has had tens of thousands of downloads, Steven Hillion, data scientist at Greenplum, told El Reg.

This, of course, defeated the whole purpose of the Greenplum database, which is to parallelize PostgreSQL to radically speed up query performance. With the Community Edition, developers can throw the database across a cluster and actually get a sense of how it will perform before shelling out the cash for the commercial edition. Moreover, they can test on the complete data set, not on a representative subset of their data.

Hillion was hired by EMC to run the Greenplum analytics lab back in May 2010, and was charged with bringing new analytics tools to the database and data warehousing appliances. (Hillion was director of engineering at Siebel, Kana Software, QRS, and M-Factor and is a mathematician from the University of California at Berkeley.) The first new tool that is being bundled with is MADlib, an open source library of analytic algorithms that Greenplum has cobbled together to do predictive modeling and interpretive statistics on their data. At the moment, Greenplum's customers use SAS, Matlab, or the open source R tools to perform these functions.

"As the scale of the problems increases, while they still want to use these tools, they also want to do their models right where the data is," explains Hillion. The data sets that customers are wrestling with are so large that you can't easily move them from machine (or cluster) to machine (or cluster). Moreover, the R tool can only do data analysis on a data set that fits inside main memory, which is why a company called Revolution Analytics launched parallelized (and closed source) extensions to R last May and continues to enhance this tool and gain customers. Fair savvy database analysts are the intended users for the MADlib extensions; they are not for the faint of heart.

But a new graphical tool called Alpine Miner, which comes from a bunch of Greenplum expats who formed a company called Alpine Solution (no S), is aimed at those who are new to data analytics and who need some help setting up workflows so they can set tools loose chewing on data.

The MADlib tools are free to Community Edition users, and EMC is happy to sell support contracts for the tools for those who want to pay for them. Alpine Solution is selling the GUI add-on on a per user basis, but pricing was not available at press time.

Customers are not supposed to use Greenplum Community Edition on any machine that is put into production, but there is nothing technical preventing you from doing so. (The software is not crippled like the single-node version, which had governors on it preventing it from scaling across multiple nodes.) The Community Edition license says that a "production license is required when used for internal data processing or any commercial or production purposes on servers larger than a single physical server with up to two (2) CPU sockets or a single virtual machine with up to eight (8) virtual CPU cores."

It will be interesting to see how many people try to put Community Edition into production. ®

3 Big data security analytics techniques

More from The Register

next story
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Kingston DataTraveler MicroDuo: Turn your phone into a 72GB beast
USB-usiness in the front, micro-USB party in the back
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
BOFH: Oh DO tell us what you think. *CLICK*
$%%&amp Oh dear, we've been cut *CLICK* Well hello *CLICK* You're breaking up...
AMD's 'Seattle' 64-bit ARM server chips now sampling, set to launch in late 2014
But they won't appear in SeaMicro Fabric Compute Systems anytime soon
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
Cisco reps flog Whiptail's Invicta arrays against EMC and Pure
Storage reseller report reveals who's selling what
prev story

Whitepapers

SANS - Survey on application security programs
In this whitepaper learn about the state of application security programs and practices of 488 surveyed respondents, and discover how mature and effective these programs are.
Combat fraud and increase customer satisfaction
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Top three mobile application threats
Learn about three of the top mobile application security threats facing businesses today and recommendations on how to mitigate the risk.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.