IBM's tools give Big Data a good seeing to
Company shares nothing but Hadoop and GPFS
IBM is using Hadoop to make its General Parallel File System capable of dealing with Big Data - extremely large data sets - for cloud-based analytic computing.
Announced at the Supercomputing 2010 conference, the General Parallel File System-Shared Nothing Cluster (GPFS-SNC) project at IBM Research Almaden involves an architecture designed to provide higher availability through clustering technologies, dynamic file system management and replication.
GPFS is the basis for IBM's High Performance Computing Systems, Information Archive, Scale-Out NAS (SONAS), and Smart Business Compute Cloud. GPFS-SNC is a distributed, shared-nothing, computing architecture in which each node is self-sufficient; tasks are divided up between these independent computers and no one node waits on any other.
Hadoop, which is used by Yahoo!, has evolved from Google's MapReduce technology for computations involving petabyte-level data sets distributed across thousands of commodity hsrdware-based computational nodes. The Hadoop Distributed File System (HDFS) is a distributed, scalable and portable file system, written in Java, involving a cluster of data nodes.
HDFS is aware of the location, in a network switch sense, of servers (worker nodes) in the cluster and the system uses this to ensure they compute data local to them and thus reduce data traffic across the network. Different copies of data are kept on different sets of worker nodes, with data being replicated across nodes this way to avoid unnecessary redundancy and high availability, without RAID, should a worker node rack or network switch fail.
HDFS is not POSIX-compliant and one aspect of the GPFS-SNC project is to provide POSIX-compliance. GPFS on its own is POSIX-compliant.
IBM says running data analytics applications in the cloud on extremely large data sets is gaining traction because it is affordable and the underlying infrastructure can store and compute the immense amount of data involved. A POSIX interface means traditional applications using POSIX interfaces can use the cloud resources.
The end-user apps IBM has in mind are things like business intelligence, digital media processing and surveillance video searches. GPFS-SNC technology decomposes the large computation involved into a set of smaller parallelisable computations. IBM reckons GPFS-SNC can work around the frequent failures expected in large-scale commodity server and storage deployments, while being an efficient user of compute, storage and network resources.
IBM's announcement statement says GPFS-SNC "will convert terabytes of pure information into actionable insights twice as fast as previously possible... the design provides a common file system and namespace across disparate computing platforms, streamlining the process and reducing disk space."
The GPFS-SNC project is likely to be used in the EU-funded, IBM-led VISION cloud project announced in the beginning of November. ®
Do you trust IBM?
In preparation for a talk I gave on Hadoop, I asked my contacts within IBM's Information Management pillar about who/what is driving IBM's Hadoop effort. In the Hadoop world, Cloudera and IBM offer support. (Well IBM says that they offer support but getting information is like pulling teeth.)
IBM's consultants that grok Hadoop are in their Emerging Tech group. IBM offer's a free version of their Hadoop release however its only 32bit and their 'up value' enhancement is a cloud configurator which is less than useless. (Hint: You don't need a cloud configurator tool.)
Emerging Tech isn't part of SWG's Information Management pillar, so that any consulting being done is on IBM Information Management paper, but isn't done by IM lab services staff. IBM's sales staff are not capable of talking to customers about Hadoop/HBase/Hive/Pig etc so they cannot effectively explain the value add and risks of choosing Hadoop.
There is no Hadoop Product Manager or someone who handles the business/marketing side of IBM's offering.
There is no IBM pointy haired exec who handles 'Big Data' as part of their growing IM portfolio. IBM acquired SPSS yet they are not in charge of this and are probably the closest things to gurus on analytics.
So how can you trust IBM to deliver something when they have no actual product offering or core center of competence?
You'd be better off hiring someone like myself who'll cost you less than half of what IBM would charge for my skill set. (Yes I do know IBM's Information Management rates for lab services consultants. ) The reason you pay higher rates to IBM lab services is that if something goes wrong, IBM will make you whole. But how can they do that if their in-house talent isn't capable of doing that?
Sorry while IBM is slowly putting their toe in the water, they are still far from being a trusted source of providing services.
I hope that IBM's takeaway is that until they jump in and develop a 'Big Data' strategy... they shouldn't be taken as a serious player.
Sorry for the mini-rant, but I'm getting sick of all of the big companies like IBM and Microsoft that jump on the 'Cloud' bandwagon because its the latest popular buzz words. To the CLOUDS!
The flame, for IBM and not El Reg.