Feeds

Pivotal ships eponymous Hadoop distro to the masses

An inquisitive HAWQ rides the big data elephant

High performance access to file storage

Pivotal, the spinout from VMware and EMC that's trying to create the application platform of the future, is shipping its first product based on the Hadoop big data muncher.

The new offering is called Pivotal HD, short for Hadoop distribution of course, and the 1.0 release will go after the Hadoop incumbents with add-ons and a relatively low price for support contracts.

Technically speaking, this is not the first time that disk array maker and software player EMC has peddled a Hadoop distribution. In the wake of its acquiring the Greenplum parallel database and related data warehouse appliance business a few years back, EMC hooked up with MapR Technologies to rebrand its own rendition of Hadoop to make its Greenplum HD variant.

But with the Pivotal HD 1.0 distribution, the EMC and VMware spinoff has gone back to the open source Apache Hadoop well and drawn out the 2.0 code as the foundation of its own 1.0 release.

Specifically, Pivotal HD includes the core Hadoop 2.0.2 software, which has the old MapReduce 1.0 algorithm for spewing and chewing data on a cluster of servers as well as the new YARN 2.0 algorithm, all riding atop the Hadoop Distributed File System (HDFS).

The YARN (also known as MapReduce 2.0) layer allows other kinds of algorithms to be loaded onto the Hadoop framework and do different kinds of scatter-gather processing across a cluster.

The Pivotal HD Community Edition, which is freely distributed and has community support, includes the HBase 0.94.2 columnar database, Hive 0.9.1 SQL query engine, Mahout 0.8.0 machine learning layer, and Pig 0.10.0 scripting language all wrapped up. The Flume 1.3.1 log collector and Sqoop 1.4.2 data exchange tool are also in the Pivotal HD Community release.

The Pivotal HD Enterprise Edition adds the Spring Java framework (contributed to Pivotal by VMware) as well as the "Project Serengeti" Hadoop virtualization extensions. The Enterprise Edition also sports a data loader to suck data into HDFS from other sources, and a unified storage service that rides atop HDFS. The Pivotal Command Center is also part of the Enterprise Edition. All of these pieces are developed by Pivotal.

The HAWQ distributed SQL query engine, which has all the smarts of the Greenplum parallel database applied to HDFS so it can speak perfectly fluent SQL, is sold as an add-on for the Enterprise Edition. (El Reg is radically oversimplifying what HAWQ is, we realize.)

The different editions and add-ons to the Pivotal HD Hadoop stack

The different editions and add-ons to the Pivotal HD Hadoop stack

HAWQ is a real-time query language and is used instead of Hive, which takes SQL-like queries and turns them into MapReduce batch jobs. HAWQ also competes against the Impala database layer that is in the Cloudera CDH4 distribution and the accelerated HBase functionality that is part of the MapR M7 Edition distribution.

In a blog post announcing the availability of the Pivotal HD 1.0 release, Saravana Krishnamurthy, head of product strategy and vision for the Hadoop stack at Pivotal, explained that there are three packages of Hadoop code being delivered today: the Community and Enterprise editions outlined above as well as something called Pivotal Single Node.

The Community edition is a free download, and you can put it into production and use the community web support on a cluster with up to 50 nodes. (It is not clear if you have to buy a support contract or upgrade to Enterprise edition after this.)

The Enterprise edition has almost all of the goodies, and it has commercial support that costs $1,000 per server node per year. That is undercutting Hortonworks, which is charging $12,000 for support for a ten-node starter cluster, and it is significantly lower than the $4,000 to $5,000 that Cloudera and MapR are charging for their commercial releases, respectively. (Both are a bit vague on pricing, and the feature sets are not perfectly analogous across the releases.) You can also get a perpetual license for Enterprise edition, but Pivotal will not say what it costs.

The HAWQ SQL database query layer for HDFS is priced on a per-node basis just like Pivotal HD Enterprise, but Pivotal is not divulging the price. It is hard to guess what Pivotal thinks it can charge for this function, but with relational databases costing tens of thousands of dollars per core, there is plenty of room to charge many thousands or maybe even tens of thousands of dollars per server node and still make a buck.

If you put a gun to the head of the El Reg systems desk, we would say that if a data warehouse costs $20,000 per terabyte and a raw Hadoop cluster costs $500 per terabyte just using HDFS and HBase, then Pivotal will probably split the difference with HAWQ and try to charge around $10,000 per terabyte.

Working backwards from this, if you have a HP DL380e Gen8 server node with a dozen 3TB drives, that gives you 36TB of data. And at that $10,000 per terabyte price outlined above, that would be $360,000 per node. That may sound crazy until you look at what IBM and Oracle try to charge for the enterprise editions of their relational databases. Still, this sounds far too expensive for a two-socket server.

The point is, what Pivotal is really selling is not so much Hadoop support but the HAWQ database layer, no matter what the price is. And the newbie company is not going to start low and then try to raise the price later.

Rather, Pivotal will try to position HAWQ against relational databases for certain kinds of analytical work and will try to charge as much as the market will bear. And it will probably find that the market will not bear anything like the cost of a relational database cut in half for mainstream customers. HAWQ will have to make it up in volume, just like Linux and MySQL did.

And if Pivotal tries to price HAWQ too high, companies will shift to Project Impala and do it themselves for a lot less (including cutting out Cloudera if it charges too much for commercial support on Impala). Whoever decides to race to the bottom in the SQL-like or SQL query layer for Hadoop is going to set the price. Period.

And no one is in a hurry to start that race just yet – except Hadoop customers. ®

High performance access to file storage

More from The Register

next story
Seagate brings out 6TB HDD, did not need NO STEENKIN' SHINGLES
Or helium filling either, according to reports
European Court of Justice rips up Data Retention Directive
Rules 'interfering' measure to be 'invalid'
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Cisco reps flog Whiptail's Invicta arrays against EMC and Pure
Storage reseller report reveals who's selling what
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
IT bods: How long does it take YOU to train up on new tech?
I'll leave my arrays to do the hard work, if you don't mind
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
USA opposes 'Schengen cloud' Eurocentric routing plan
All routes should transit America, apparently
prev story

Whitepapers

Mainstay ROI - Does application security pay?
In this whitepaper learn how you and your enterprise might benefit from better software security.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.