EMC Greenplum Hadoop elephant straddles Cisco iron
Cah. Took them long enough
Well, that took long enough. Cisco Systems and the Greenplum big data unit of server partner EMC have finally gotten together and put the Greenplum wares on Cisco's Unified Computing System servers.
In a blog posting, Raghunath Nambiar, an architect at Cisco's Server Access and Virtualization Technology Group, reveals that the two partners in the Virtual Computing Environment Company has circled back and are now offering pre-configured Hadoop stacks that marry Cisco's C-Series rack servers and Greenplum's eponymous Greenplum MR Hadoop distribution.
Greenplum doesn't like to talk about the hardware its data warehousing and Hadoop clusters run upon, mainly because EMC, as an independent disk array maker and the owner of server virtualization juggernaut VMware, has to position itself as Switzerland in the server racket. Before it was acquired by EMC in July 2010 for an undisclosed sum, Greenplum had run its heavily customized implementation of the PostgreSQL database, which was parallelized and juiced to run data warehouse clusters, on Sun Fire x86 servers from Sun Microsystems. This was a good choice at the time, given the large amount of disk capacity that Sun had crammed onto its Opteron and Xeon servers, but a bad choice in the long term because database rival Oracle ate Sun. In the wake of the Sun acquisition, Greenplum has certified its code to run on Dell, Hewlett-Packard, and Huawei Technologies x86 servers and OEMs this iron from those companies, depending on what customers want.
EMC did not, interestingly enough, plunk the Greenplum Modular Data Computing Appliance data warehouse or its Hadoop appliance, which is actually based on a rebadged Hadoop stack from MapR Technologies, on the Vblock server-storage clusters it cooked up with Cisco to chase server virtualization and private cloud business in data centers and now virtual desktops. While the B Series blade servers in the UCS family may not be suitable for Greenplum workloads, the C Series rack servers could certainly be configured in a Vblock by EMC and Cisco to run this Greenplum code, but were not.
Part of the problem was that Hadoop doesn't use external storage, so there would be no EMC iron in such a Vblock. It is very likely that EMC and Cisco were waiting for Cisco to get a little more traction in the server racket – Cisco's server business now has more than 10,000 customers and a $1bn annual revenue run rate that will probably nearly double in the next year – before committing the Greenplum wares to the UCS platform.
According to Nambiar, the fully integrated Cisco-EMC stack takes Cisco's UCS C Series rack servers and its UCS 6200 converged server-storage 10GE switches and fabric interconnects and configures up the Greenplum MR Hadoop distro to run on the boxes. (This Hadoop distro is MapR's M5 Hadoop distribution with the names changed.) The setups start at a single rack and can be expanded to cover multiple racks. The UCS 6200 switch links into UCS 2200 fabric extenders, and according to the reference architecture (PDF), the UCS C210 M2 server is the workhorse that Cisco and EMC have chosen to run Hadoop. The C210 M2 server was announced in March 2010 and is a two-socket box that uses Intel's six-core Xeon 5600 processors and will no doubt be replaced by a new machine using Intel's "Sandy Bridge-EP" Xeon E5 chip. The C210 M2 can support up to 192GB of DDR3 main memory and has room for 16 2.5-inch disk drives and one or two RAID disk controllers.
In a single-rack configuration, the Greenplum MR-UCS stack has two 48-port UCS 6248UP fabric interconnects and two 2232PP 10GE fabric extenders. These link down into 16 of the C210 M2 servers, which have 96GB of main memory and 16 1TB disk drives, an LSI MegaRAID 9261-8i disk controller, and a Cisco UCS P81F virtual interface card that presents two 10GE ports to the fabric extenders. Cisco is dropping in the six-core Xeon X5670 processors, which run at 2.93GHz. Each rack has 192 cores, 256TB of raw storage capacity, and up to 350TB of usable Hadoop capacity with three-way data replication across the nodes and data compression turned on. The nodes are configured with Red Hat Enterprise Linux Standard Edition. ®