Greenplum appliances swing both ways
Spinning up data warehouses and Hadoop
The Greenplum data analytics unit of disk storage giant EMC is tweaking its Data Center Appliances, not only offering a more modular architecture and pricing scheme that lets companies start small and grow their analytics, but also allowing for the mixing of the Greenplum parallel database with Hadoop nodes within the same infrastructure and firing up either node as the workloads demand.
The new Modular Data Computing Appliance launched today at the Strata Summit in New York is based on the same two-socket Xeon iron that was used in the product refresh that EMC did back in April for its DCA lineup.
The DCA uses a massively parallel implementation of the open source PostgreSQL database, which is plunked on a shared-nothing disks setup that stores data on dozens to hundreds of server nodes. Depending on your preferences, EMC is happy to sell you two-socket Xeon 5600 servers made by Dell, Hewlett-Packard, and Huawei.
The plain-vanilla DCA uses 600GB SAS disks in the servers to deliver 36TB of data warehousing space (144TB with compression); the High Capacity DCA uses 2TB SATA drives for 124TB of space (496TB with compression), and the High Performance DCA loads up 336 cores and 336 SSDs to deliver 44TB of usable data warehousing capacity (176TB compressed), but which has a database scan rate of 72GB/sec – three times the regular DCA - and a data load rate of 20TB/hour - twice the vanilla DCA cluster. EMC uses 10 Gigabit Ethernet switches from Brocade Communications to lash all the nodes together, by the way.
Back in May, EMC rolled out its own supported versions of the Hadoop MapReduce data crunching software created by Yahoo! and now controlled by the Apache Software Foundation. The company put out two different versions of the software, ran it in the High Capacity DCA appliance, and slapped the Greenplum brand on it. Greenplum HD Community Edition takes the Apache code related to Hadoop – the MapReduce code and the HDFS distributed file system, as well as the companion Apache Hive (SQL-like query), HBase (a column-oriented distributed data store modeled after Google's BigTable), and ZooKeeper (a configuration server for clusters) and rolls it all up and offers for-fee tech support.
EMC's Greenplum HD Enterprise Edition adds in proprietary features for snapshotting and replicating data on Hadoop clusters and accessing Hadoop data through an NFS interface as well as various Hadoop management features that are compatible with Hadoop interfaces but not open source just the same.
Back in May, EMC was asking customers to buy two separate appliances and essentially keep its Greenplum data warehouses and Hadoop clusters on separate iron. But starting today, customers can buy the Modular DCA and run Greenplum and Hadoop on the same iron.
"This is the first true mix and match of data warehousing and Hadoop in the industry," claims Mike Maxey, director of product marketing at the Greenplum unit of EMC. Teradata would probably argue with that, and maybe IBM, too.
The DCA management software can provision a node for Hadoop or for the Greenplum MPP database on the modules, as needed. The Modular DCA is also more, er, modular, in that it comes in quarter rack units now instead of full rack units. As you want to expand, you add database, Hadoop or data integration modules, the latter of which run third party extract/transform/load (ETL) or analytics software. EMC has also integrated its Data Domain deduplication and archiving software with the Greenplum DCA's as well as its TimeFinder/Snap snapshotting and Symmetrix Remote Data Facility for remote, asynchronous replication over a wide area network.
While Greenplum is not providing pricing on the new Modular DCA machinery, Maxey said that another thing that customers wanted was a capacity-on-demand pricing model. So now Greenplum will sell you a DCA setup where you buy everything in the rack and another where you can turn processors, memory, or other features of the system on as you need them without having to pay for all the capacity in the rack. The intent here is to provide pricing that makes sense for smaller data warehouses that need more processing oomph or memory capacity, but not necessarily both at the same time.
Giving back to the Apache Hadoop community
EMC has also teamed up with Intel, VMware, Micron, Seagate Technologies, Super Micro, and Mellanox Technologies to put together a 1,000-node Hadoop cluster that is being hosted at the monstrous Switch Communications data center in Las Vegas.
While conceding that Yahoo! has done a "fantastic job" at creating and fostering the Apache Hadoop community, Maxey says that Hadoop developers need access to bigger iron.
At Yahoo!, which has around 40,000 servers dedicated to running Hadoop, the biggest cluster is running on 4,000 nodes. Think of that as the upper limit of scale at the moment for Hadoop, for all practical purposes. But most developers who are monkeying around with Hadoop are doing so on 20 or 25 nodes, says Maxey, and for a lot of Hadoop jobs, issues don't crop up until you push them to a higher scale.
And so EMC and friends are creating this 1,000-node cluster, to be called the Greenplum Analytics Workbench, of course, available to the open source community to test future releases of the code upon. The cluster should be live in January next year. ®
At least one El Reg reader is interested in Greenplum...
..."last all the nodes together"...I suspect you mean "lash". We used GigE on our demo system. Not sure how much lashing was involved.
..."extract/test/load (ETL)"...try extract/transform/load, but you knew that ;-)
I have heard that Greenplum use ZFS, because Greenplum has such a high performance that they move very many bits. Statistically, you face data corruption every 15 minutes at those high speeds. That is the reason they use ZFS, because ZFS protects against bit rot and data corruption.
Can anyone confirm?