HP rolls out Hadoop AppSystem stack

Lashes elephants to Autonomy, Vertica

Top three mobile application threats

Everybody wants to get a piece of the Hadoop elephant in the room, and luckily for Hewlett-Packard the mad scramble to make sense of big data is just at the beginning of its hype cycle and there's time enough to get some choice cuts.

Hadoop ElephantAs part of its Discover customer and partner extravaganza in Las Vegas this week, HP has rolled out a new AppSystem for Apache Hadoop integrated stack, The company is also talking up the ability to link the Hadoop Distributed File System and Hadoop applications with the company's two other major (and relatively recently acquired) big data software tools.

These are the Vertica column-oriented distributed database and the Autonomy Intelligent Data Operating Layer (IDOL) 10 stack, which creates contextual information from unstructured text, audio, and video data to make it "understandable."

HP is not about to shell out a lot of money to try to buy a Hadoop distie, although a strong case could be made that perhaps it should have already done such a thing, even if the current top Hadoopers – Cloudera, Hortonworks, and MapR – would probably be too expensive relative to the revenues and profits they currently generate.

Just like HP never did buy a Linux distributor or a database maker, the company will play Switzerland and work on preconfiguring Hadoop stacks on its servers, storage, and switches and peddle add-on consulting services to IT shops and integration with its Vertica data store and Autonomy data organizer to make its dough off big data.

The AppSystem for Apache Hadoop is a reference architecture of components, just like similar AppSystems for running the Vertica data store and data analytics and data warehouse appliance software from Microsoft and SAP and email messaging setups running Windows and Exchange.

Just like HP is not picking just one data store or database for the various AppSystems launched in the last year, HP is not about to play favorites with the Hadoop distros, either. But, because HP is not silly, it is starting out with the fully open source distro from Cloudera, known as Cloudera Distribution for Hadoop, or CDH3 for short in the current release.

Specifically, the AppSystem for Hadoop loads the 64-bit edition of Red Hat Enterprise Linux 6.2 on ProLiant server nodes and then layers on the CDH3 Update 3 Hadoop toolset, which includes the core Hadoop MapReduce workload scheduler and the Hadoop Distributed File System (HDFS) as well as a slew of related open source tools to pumping data into and querying data stored in that file system.

HP's reference architecture as embodied in the AppSystem set up also encourages customers to get the Cloudera Manager 3.7.5 add-on for the CDH3u3 distribution. Steve Watt, chief technologies for Hadoop and big data at HP, tells El Reg that Cloudera Manager does a lot of Hadoop-specific, node-level configuration tasks that can make life a whole lot easier for Hadoop cluster admins. (You don't want to have to change a setting manually in 2,000 nodes, now do you?)

Some of the software in the AppSystem for Hadoop comes from HP, which is good. HP is adding in its own Insight Cluster Management Utility v7.0 software, which is a traditional HPC cluster tool that does deployment, management, and monitoring of software stacks running on nodes in a cluster. HP is also weaving in its own Insight Control v7.0 for Linux server management tools, which let admins drilled down and manage at the node level.

The AppSystem for Hadoop reference architecture is not just a recipe for software, of course, but also a very precise configuration for running a production-grade Hadoop instance. And this is what makes any reference architecture interesting because the presumption is that the IT vendor has done the capacity planning homework so you don't have to. This particular setup is designed to fit into a single rack.

There's one management node for running the HP software stack and the Cloudera Manager, one Hadoop namenode (which spreads the data over the nodes in the cluster running HDFS), one JobTracker node (which issues MapReduce jobs to the nodes for data munching), and 18 worker nodes (which is where the data replication and munching happens). Here's how it is configured:

HP's AppSystem for Hadoop reference architecture

HP's AppSystem for Hadoop reference architecture (click to enlarge)

The interesting bit is that HP is putting six-core Xeon E5-2667 processors running at 2.9GHz and slightly more than one disk spindle per core into the worker nodes. (With the modern Hadoop setups, you like to have at least one spindle per core, according to various people we have spoken to as they build their reference architectures.)

The other thing to note is that Gigabit Ethernet is perfectly fine, and the nodes don't have huge wonking chunks of memory, either. As you need to add capacity to this AppSystem for Hadoop, you slide in racks of 2U ProLiant DL380 Gen8 nodes, and you can put 19 of these in a rack with some room left over for a switch and a slideaway.

One of the problems with Hadoop clusters, says Watt, is that you can't get a visual sense of what is going on inside of the cluster as jobs are running, and people being inherently visual (well, most of us anyway), this is a drawback. So HP is bolting on its Cluster Management Utility onto the Hadoop cluster and letting admins see what the cluster is doing in terms of bottlenecks. Here's what the tool's interface shows:

Cluster Management Utility for AppSystem for Hadoop

Cluster Management Utility for AppSystem for Hadoop (click to enlarge)

Each one of those cylinders is composed of a timeslice that shows the aggregate CPU load, memory consumption, and Ethernet bandwidth both transmitting and receiving on each node of the Hadoop cluster.

As workloads expand and contract, so do the cylinders, and you can fly along the cylinders and drill down into them to do troubleshooting as you try to deal with bottlenecks. Hadoop tops out at around 4,000 nodes in a single instance (using CDH3 or any other distribution, since it is a limit of the namenode), which is well below the largest clusters that HP has in the HPC space using this tool.

At the moment, explains Paul Miller, vice president of solutions and strategic alliances for the Enterprise Servers, Storage and Networking group, HP does not have an OEM agreement with Cloudera and will be encouraging customers to buy their own licenses to Cloudera Enterprise, which includes tech support for CDH3 as well as the Cloudera Manager tool (which is not open source). And HP plans on putting together other Hadoop stacks, too.

"The Hadoop market is in its infancy right now, and different workloads will work differently on various distributions," says Miller. "Se we are going to be open and support HortonWorks, MapR, and Cloudera."

The reference architectures for all three Hadoop distributions will be available later this month, and HP expects for its Factory Express system configuration service to be able to build a complete AppSystem stack on the Cloudera stack by the fourth quarter of this year. (Why is this taking so long you ask? Good question, and while HP did not answer it, the suspicion at the Reg Elephant Desk is that it is tied to a future release of Cloudera software.)

Big Data strategy boutique

HP Technology Services is also trying to get in on the Hadoop action, with a Big Data Strategy Workshop to help IT shops try to figure out what big data they have and the best ways to chew through it. The services arm of ESSN is also running the Roadmap Service for Apache Hadoop services engagement, which is really a best practices and capacity planning session for newbies.

In both cases, you can bet Meg Whitman's last dollar that there will be lots of talk about Vertica and Autonomy during those sessions. These are fee-based engagements, although HP did not say how much they cost; presumably if you are paying, the sales pitch is more subtle and experts actually impart some knowledge.

On the Autonomy front, HP has announced the capability to put the IDOL 10 engine, which supports over 1,000 file types and connects to over 400 different kinds of data repositories, onto each node in a Hadoop cluster. So you can MapReduce the data and let Autonomy make use of it. For instance, you can use it to feed the Optimost Clickstream Analytics module for the Autonomy software, which also uses the Vertica data store for some parts of the data stream.

HP is also rolling out its Vertica 6 data store, and the big new feature is the ability to run the open source R statistical analysis programming language in parallel on the nodes where Vertica is storing data in columnar format. More details on the new Vertica release were not available at press time, but Miller says that the idea is to provider connectors between Vertica, Hadoop, and Autonomy so all of the different platforms can share information. ®

High performance access to file storage

More from The Register

next story
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Kingston DataTraveler MicroDuo: Turn your phone into a 72GB beast
USB-usiness in the front, micro-USB party in the back
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Inside the Hekaton: SQL Server 2014's database engine deconstructed
Nadella's database sqares the circle of cheap memory vs speed
BOFH: Oh DO tell us what you think. *CLICK*
$%%&amp Oh dear, we've been cut *CLICK* Well hello *CLICK* You're breaking up...
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
IT bods: How long does it take YOU to train up on new tech?
I'll leave my arrays to do the hard work, if you don't mind
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
prev story


Top three mobile application threats
Learn about three of the top mobile application security threats facing businesses today and recommendations on how to mitigate the risk.
Combat fraud and increase customer satisfaction
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
SANS - Survey on application security programs
In this whitepaper learn about the state of application security programs and practices of 488 surveyed respondents, and discover how mature and effective these programs are.