Feeds

Greenplum appliances swing both ways

Spinning up data warehouses and Hadoop

Next gen security for virtualised datacentres

The Greenplum data analytics unit of disk storage giant EMC is tweaking its Data Center Appliances, not only offering a more modular architecture and pricing scheme that lets companies start small and grow their analytics, but also allowing for the mixing of the Greenplum parallel database with Hadoop nodes within the same infrastructure and firing up either node as the workloads demand.

The new Modular Data Computing Appliance launched today at the Strata Summit in New York is based on the same two-socket Xeon iron that was used in the product refresh that EMC did back in April for its DCA lineup.

The DCA uses a massively parallel implementation of the open source PostgreSQL database, which is plunked on a shared-nothing disks setup that stores data on dozens to hundreds of server nodes. Depending on your preferences, EMC is happy to sell you two-socket Xeon 5600 servers made by Dell, Hewlett-Packard, and Huawei.

The plain-vanilla DCA uses 600GB SAS disks in the servers to deliver 36TB of data warehousing space (144TB with compression); the High Capacity DCA uses 2TB SATA drives for 124TB of space (496TB with compression), and the High Performance DCA loads up 336 cores and 336 SSDs to deliver 44TB of usable data warehousing capacity (176TB compressed), but which has a database scan rate of 72GB/sec – three times the regular DCA - and a data load rate of 20TB/hour - twice the vanilla DCA cluster. EMC uses 10 Gigabit Ethernet switches from Brocade Communications to lash all the nodes together, by the way.

Back in May, EMC rolled out its own supported versions of the Hadoop MapReduce data crunching software created by Yahoo! and now controlled by the Apache Software Foundation. The company put out two different versions of the software, ran it in the High Capacity DCA appliance, and slapped the Greenplum brand on it. Greenplum HD Community Edition takes the Apache code related to Hadoop – the MapReduce code and the HDFS distributed file system, as well as the companion Apache Hive (SQL-like query), HBase (a column-oriented distributed data store modeled after Google's BigTable), and ZooKeeper (a configuration server for clusters) and rolls it all up and offers for-fee tech support.

EMC's Greenplum HD Enterprise Edition adds in proprietary features for snapshotting and replicating data on Hadoop clusters and accessing Hadoop data through an NFS interface as well as various Hadoop management features that are compatible with Hadoop interfaces but not open source just the same.

Back in May, EMC was asking customers to buy two separate appliances and essentially keep its Greenplum data warehouses and Hadoop clusters on separate iron. But starting today, customers can buy the Modular DCA and run Greenplum and Hadoop on the same iron.

"This is the first true mix and match of data warehousing and Hadoop in the industry," claims Mike Maxey, director of product marketing at the Greenplum unit of EMC. Teradata would probably argue with that, and maybe IBM, too.

The DCA management software can provision a node for Hadoop or for the Greenplum MPP database on the modules, as needed. The Modular DCA is also more, er, modular, in that it comes in quarter rack units now instead of full rack units. As you want to expand, you add database, Hadoop or data integration modules, the latter of which run third party extract/transform/load (ETL) or analytics software. EMC has also integrated its Data Domain deduplication and archiving software with the Greenplum DCA's as well as its TimeFinder/Snap snapshotting and Symmetrix Remote Data Facility for remote, asynchronous replication over a wide area network.

While Greenplum is not providing pricing on the new Modular DCA machinery, Maxey said that another thing that customers wanted was a capacity-on-demand pricing model. So now Greenplum will sell you a DCA setup where you buy everything in the rack and another where you can turn processors, memory, or other features of the system on as you need them without having to pay for all the capacity in the rack. The intent here is to provide pricing that makes sense for smaller data warehouses that need more processing oomph or memory capacity, but not necessarily both at the same time.

Giving back to the Apache Hadoop community

EMC has also teamed up with Intel, VMware, Micron, Seagate Technologies, Super Micro, and Mellanox Technologies to put together a 1,000-node Hadoop cluster that is being hosted at the monstrous Switch Communications data center in Las Vegas.

While conceding that Yahoo! has done a "fantastic job" at creating and fostering the Apache Hadoop community, Maxey says that Hadoop developers need access to bigger iron.

At Yahoo!, which has around 40,000 servers dedicated to running Hadoop, the biggest cluster is running on 4,000 nodes. Think of that as the upper limit of scale at the moment for Hadoop, for all practical purposes. But most developers who are monkeying around with Hadoop are doing so on 20 or 25 nodes, says Maxey, and for a lot of Hadoop jobs, issues don't crop up until you push them to a higher scale.

And so EMC and friends are creating this 1,000-node cluster, to be called the Greenplum Analytics Workbench, of course, available to the open source community to test future releases of the code upon. The cluster should be live in January next year. ®

5 things you didn’t know about cloud backup

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
Docker kicks KVM's butt in IBM tests
Big Blue finds containers are speedy, but may not have much room to improve
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Gartner's Special Report: Should you believe the hype?
Enough hot air to carry a balloon to the Moon
Flash could be CHEAPER than SAS DISK? Come off it, NetApp
Stats analysis reckons we'll hit that point in just three years
Dell The Man shrieks: 'We've got a Bitcoin order, we've got a Bitcoin order'
$50k of PowerEdge servers? That'll be 85 coins in digi-dosh
prev story

Whitepapers

Endpoint data privacy in the cloud is easier than you think
Innovations in encryption and storage resolve issues of data privacy and key requirements for companies to look for in a solution.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Solving today's distributed Big Data backup challenges
Enable IT efficiency and allow a firm to access and reuse corporate information for competitive advantage, ultimately changing business outcomes.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.