Greenplum previews unified Hadoop biz-intel stack
Time to buy MapR and dust off the Aviion brand
Upstart data-warehousing appliance maker Greenplum said on Wednesday that it is getting set to mash up its data-warehousing software with its OEMed distribution of the Hadoop big data muncher to create a single über data platform that it is logically calling the Unified Analytics platform.
But what the company did not announce - and could do if it is smart - may be more interesting in the long run.
In an ideal world, the same clusters of x86 servers that run a data warehouse would be able to also run Hadoop alternately through rapid reprovisioning, or even concurrently on the same clusters as workloads demanded it. The resulting cluster would all be managed by a single set of tools and provide a means of rapidly moving data from the relational database into the Hadoop Distributed File System or another data service like the column-oriented HBase add-on for HDFS. And while you're at it, the same iron would be able to run R statistical analysis in a distributed fashion or any other parallel calculation engine.
Like everyone else in the data analytics racket – OK, IBM, Teradata, and Oracle – Greenplum has already been trying to convince customers that it can already do this. So in May the Greenplum division said it would be rolling up its own open source and beefed up versions of Hadoop.
The Greenplum HD Community Edition took open source Hadoop – the MapReduce code and the HDFS distributed file system, as well as the companion Apache Hive (SQL-like query), HBase (a column-oriented distributed data store modeled after Google's BigTable), and ZooKeeper (a configuration server for clusters) – and rolled it all up and offered support for it.
The Greenplum HD Enterprise Edition added in proprietary features for snapshotting and replicating data on Hadoop clusters and accessing Hadoop data through an NFS interface, as well as various Hadoop management features that are compatible with Hadoop interfaces but not open source.
These smelled suspiciously like the M3 open source and M5 commercial Hadoop releases from Hadoop distie MapR Technologies, which came out of stealth in May and which rolled out its distros in June. (These MapR Hadoop stacks were just updated yesterday.)
While Greenplum was being cagey about it at the time, the company had indeed inked a distribution agreement with MapR, and is reselling the code under its Greenplum brand.
In September, Greenplum took the next logical step, launching the Modular Data Computing Appliance back, a cluster that could swing both ways – PostgreSQL-derived data warehousing or Hadoop-driven big data munching – provisioning nodes on each cluster as needed to run either workload.
With the Greenplum Unified Analytics Platform, EMC will be more tightly coupling these two products and selling them as freestanding software running on EMC's OEMed servers (which come from Dell, HP, and Huawei) as well as any other x86 machines that customers want to plunk it down on – and that includes private and hybrid clouds if that is what customers want, Mike Maxey, director of product marketing at the Greenplum, tells El Reg.
The secret sauce in this Unified Analytics Platform – aside from the OEM'd iron and the OEM'd Hadoop from MapR, which were sorta secrets – is Greenplum's own Chorus front-end and management tool for managing these hybrid analytics clusters, and making it easier for line-of-business managers and data scientists to ask questions and get answers.
Block diagram of the Greenplum Unified Analytics Platform
Like the juicier elements of the MapR Hadoop distribution – high availability for Hadoop NameNode and JobTracker nodes and NFS mounting of the HDFS file system – the Chorus tool created by Greenplum for its data warehousing management and now Hadoop management will remain closed source.
Chorus adds a Facebook-style collaboration interface to the analytics so that people can share data, and it also has a full metadata search so researchers can do data exploration in either structured or unstructured data.
The Chorus 2.0 management tool also knows how to spin up a sandbox inside of a data warehouse or Hadoop cluster, or spin up a data mart inside of a VMware virtual machine if you want to isolate it to do queries. The prior Chorus 1.2 release did not know how to speak Hadoop or deal with VMware virtual machines, and Chorus 2.0 will be able to do this.
The converged products will also allow for parallel data-loading technology called gNET by Greenplum, which was originally conceived as a means of moving data into and out of a data warehouse quickly, or to do queries on live subsets of data, to work with Hadoop data sets as well.
To try to get everyone speaking a simpler language, the relational side of the analytics platform will simply now be called the Greenplum Database, and it is referred to as a real-time query engine, while the other key part is Greenplum Hadoop, now called a batch processing engine.
Greenplum is trying to convince everyone in this block diagram that there is a unified data access and query layer, which supports SQL, MapReduce, SAS, MADlib, Mahout, R, and other query and data access methods, but clearly only certain access methods work on either database or HDFS data stores. The idea is to mask these as much as possible with Chorus – but how successful Greenplum will be in this remains to be seen.
And of course, with the Unified Analytics Appliance, Greenplum is letting go completely of the idea that you have to run its database on hardware supplied by Greenplum. The Database, HD, and Chorus modules will be available with perpetual and subscription licenses as well as sold bundled on Data Computing Appliance iron.
Time to control the stack
EMC doesn't want to admit it wants to be in – or even needs to be in – the physical server business, particularly as the majority stakeholder in server virtualization juggernaut VMware and a company that is trying to play Switzerland in the storage space relative to other server makers.
But if EMC wants to sell clusters running Greenplum Database and HD, it needs to peddle actual physical boxes. More importantly, it has to peddle boxes that can do either job reasonably well and provide it some profits. Even though EMC is a key partner with Cisco Systems peddling the virtualization layer and storage arrays for the Vblock stacks of the "California" Unified Computing System blade servers, this iron is not used in Greenplum appliances because it is not well suited to either data warehouses or Hadoop, which require more storage on a node than the UCS blades can deliver and a much lower price tag, too.
With former Intel bigwig Pat Gelsinger as EMC's chief operating officer, it should be a no-brainer to get an OEM contract for rack-based servers directly from Intel. (Gelsinger used to run Intel's PC and server chip businesses before coming to EMC.) This might irk Cisco, but EMC has to decide it if wants get more money or not. And thanks to the acquisition of Data General, EMC already has a well-known server brand: Aviion. Take out the extra i and we are good to go.
How ironic would it be to have Hadoop, named after a stuffed elephant, running on a server called Avion? You could call the solution Dumbo and peddle it to upper management...
And while EMC is at it – since it is dependent on MapR for its Hadoop distribution – it makes very good sense to acquire MapR before someone else does – think Oracle, Dell, IBM, or HP.
EMC, which ate Greenplum a year and a half ago, has high hopes for its analytics division. Greenplum doesn't just want to sell data appliances for relational data, it also wants to sell software for all kinds of data stores and queries – and more importantly, it wants to sell the idea that EMC is a player that can stand up against IBM, Oracle, and Teradata.
The company now has 330 customers, Maxey tells El Reg, up from 150 at the time the deal was done, and the division now has 450 employees, more than three times the number it had before Joe Tucci made Greenplum an offer it could not refuse.
"We're really starting to see the scale and heft from EMC, so it has been really good for us," says Maxey. ®
Sponsored: Hyper-scale data management