Feeds

Upstart MapR unveils 'next generation' Hadoop

Stuffed elephant gets major surgery

  • alert
  • submit to reddit

5 things you didn’t know about cloud backup

Silicon Valley startup MapR has unveiled what it calls the "next generation" of Hadoop, revealing that this revamp of the open source distributed number-crunching platform drives the Hadoop appliance recently announced by EMC.

According to MapR CEO and cofounder John Schroeder, the company has rebuilt Hadoop's storage system, job tracker, and name space, providing more rapid access to data and improving the platform's ability to scale. MapR's Hadoop is not entirely open source, Schroeder says. It is now in use by "quite a few" customers, including EMC, whose Greenplum HD Enterprise Edition of Hadoop is essentially MapR technology.

Based on Google's back-end infrastructure and named after a yellow stuffed elephant, Hadoop offers a distributed file system (HDFS) and a distributed number-crunching platform (Hadoop MapReduce) as well as various other tools. It's typically used for offline data processing, though a sister project, Hbase, offers a real-time distributed database.

The existing Hadoop storage system, Schroeder says, is too limiting. "It's kind of like writing to a CD-ROM. You can write data to a file, but you can't do random reads and writes, and you can't do multiple concurrent readers and writers. It forces Hadoop to become more of a batch mode platform," Schroeder tells The Register. Over the past two years, Schroeder and his team have revamped the storage system, working to ease access to data not only from Hadoop itself but from other platforms as well.

This involved setting up a network file system (NFS) mount. "What we've done is rearchitect the storage services so that they provide random read and random write for multiple readers and writers, and then we expose that as an NFS mount, so in addition to being able to use that data from Hadoop APIs, you can use all your standard Linux tools and Unix tools and applications," he says. "You can create real-time data streaming out of Hadoop. You can make Hadoop look like a big C: drive on your Windows desktop."

MapR has also revamped Hadoop's Job Tracker, which distributes jobs across a Hadoop cluster and then manages their execution. Typically, Job Tracker is a single process on a single machine, but MapR has introduced a Job Tracker that's distributed across several machines. "Job Tracker failures are fairly common. They have a tendency to crash clusters and leave applications in an unknown state, so we've introduced a high-availability Job Tracker," Schroeder says.

Similarly, the company has rebuilt another single point of failure: NameNode, Hadoop's global namespace. MapR offers a distributed NameNode that scales to more files. "If you pull a node out of the cluster and it's running Job Tracker or NameNode, everything will continue to run without any impact."

Ben Werther – vice president of products at DataStax, a company that has put Hadoop MapReduce atop the open source Cassandra database and file system with a platform called Brisk – told us that the NameNode single point of failure is a "major, major problem" with Hadoop HDFS. Brisk avoids the problem because it does not use HDFS.

Facebook is already running a version of HDFS with no single point of failure for its new Messages platform, which makes use of HBase. And according to Amr Awadallah, the CTO of Hadoop startup Cloudera, the open source version of Hadoop will eliminate the single point of failure in "a few months".

Separately, MapR has introduced a new "data protection layer" with its Hadoop, letting you take a snapshot of your data every so often for recovery purposes. "Hadoop offer replication. You can offer multiple copies of data. But that doesn't guard you against user error. If I come in and delete a directory, it's gone," Schroeder says. "With MapR, if you do have a user or application error, you can go to a snapshot directory and recover."

Cloudera's Awadallah takes issue with the MapR platform because it's not open source. "Cloudera firmly believes in the superiority and the many short-term/long-term advantages of open source over proprietary implementations of Apache Hadoop. Open source Hadoop benefits from a lot of minds continually working to improve and test the code," he tells us.

"Most importantly, the reason why CIOs/CTOs love Apache Hadoop is eliminating the nontrivial risks of vendor lock in, i.e., nobody is holding a shotgun to their head asking them to pay more fees or else the [repository] hosting all their data gets nuked. Cloudera only makes money from our customers if we are delivering value to them, but if we cease delivering value then they can continue hosting their data in HDFS and use a different vendor for the support and management applications."

Cloudera offers its own completely open source Hadoop distro, but it also offers a for-pay "enterprise" version that puts various proprietary tools atop that platform.

Schroeder claims that MapR's platform is "at least two times faster" than existing Hadoop distros, citing standard benchmarks such as Terasort. This, he says, lets you run Hadoop on fewer servers. But this benchmarks cited by Schroeder may be optimized for certain workloads and not others. Schroeder was noncommittal about plans to open source the company's work or to offer its platform outside of OEMs such as EMC. ®

Build a business case: developing custom apps

More from The Register

next story
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
Shoot-em-up: Sony Online Entertainment hit by 'large scale DDoS attack'
Games disrupted as firm struggles to control network
Silicon Valley jolted by magnitude 6.1 quake – its biggest in 25 years
Did the earth move for you at VMworld – oh, OK. It just did. A lot
VMware's high-wire balancing act: EVO might drag us ALL down
Get it right, EMC, or there'll be STORAGE CIVIL WAR. Mark my words
Forrester says it's time to give up on physical storage arrays
The physical/virtual storage tipping point may just have arrived
Better be Nimble, tech giants, or mutant upstarts will make off with your sales
Usual suspects struggling to create competing products
VMware vaporises vCHS hybrid cloud service
AnD yEt mOre cRazy cAps to dEal wIth
prev story

Whitepapers

A new approach to endpoint data protection
What is the best way to ensure comprehensive visibility, management, and control of information on both company-owned and employee-owned devices?
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Maximize storage efficiency across the enterprise
The HP StoreOnce backup solution offers highly flexible, centrally managed, and highly efficient data protection for any enterprise.
How modern custom applications can spur business growth
Learn how to create, deploy and manage custom applications without consuming or expanding the need for scarce, expensive IT resources.
Next gen security for virtualised datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.