Feeds

Upstart MapR unveils 'next generation' Hadoop

Stuffed elephant gets major surgery

  • alert
  • submit to reddit

Next gen security for virtualised datacentres

Silicon Valley startup MapR has unveiled what it calls the "next generation" of Hadoop, revealing that this revamp of the open source distributed number-crunching platform drives the Hadoop appliance recently announced by EMC.

According to MapR CEO and cofounder John Schroeder, the company has rebuilt Hadoop's storage system, job tracker, and name space, providing more rapid access to data and improving the platform's ability to scale. MapR's Hadoop is not entirely open source, Schroeder says. It is now in use by "quite a few" customers, including EMC, whose Greenplum HD Enterprise Edition of Hadoop is essentially MapR technology.

Based on Google's back-end infrastructure and named after a yellow stuffed elephant, Hadoop offers a distributed file system (HDFS) and a distributed number-crunching platform (Hadoop MapReduce) as well as various other tools. It's typically used for offline data processing, though a sister project, Hbase, offers a real-time distributed database.

The existing Hadoop storage system, Schroeder says, is too limiting. "It's kind of like writing to a CD-ROM. You can write data to a file, but you can't do random reads and writes, and you can't do multiple concurrent readers and writers. It forces Hadoop to become more of a batch mode platform," Schroeder tells The Register. Over the past two years, Schroeder and his team have revamped the storage system, working to ease access to data not only from Hadoop itself but from other platforms as well.

This involved setting up a network file system (NFS) mount. "What we've done is rearchitect the storage services so that they provide random read and random write for multiple readers and writers, and then we expose that as an NFS mount, so in addition to being able to use that data from Hadoop APIs, you can use all your standard Linux tools and Unix tools and applications," he says. "You can create real-time data streaming out of Hadoop. You can make Hadoop look like a big C: drive on your Windows desktop."

MapR has also revamped Hadoop's Job Tracker, which distributes jobs across a Hadoop cluster and then manages their execution. Typically, Job Tracker is a single process on a single machine, but MapR has introduced a Job Tracker that's distributed across several machines. "Job Tracker failures are fairly common. They have a tendency to crash clusters and leave applications in an unknown state, so we've introduced a high-availability Job Tracker," Schroeder says.

Similarly, the company has rebuilt another single point of failure: NameNode, Hadoop's global namespace. MapR offers a distributed NameNode that scales to more files. "If you pull a node out of the cluster and it's running Job Tracker or NameNode, everything will continue to run without any impact."

Ben Werther – vice president of products at DataStax, a company that has put Hadoop MapReduce atop the open source Cassandra database and file system with a platform called Brisk – told us that the NameNode single point of failure is a "major, major problem" with Hadoop HDFS. Brisk avoids the problem because it does not use HDFS.

Facebook is already running a version of HDFS with no single point of failure for its new Messages platform, which makes use of HBase. And according to Amr Awadallah, the CTO of Hadoop startup Cloudera, the open source version of Hadoop will eliminate the single point of failure in "a few months".

Separately, MapR has introduced a new "data protection layer" with its Hadoop, letting you take a snapshot of your data every so often for recovery purposes. "Hadoop offer replication. You can offer multiple copies of data. But that doesn't guard you against user error. If I come in and delete a directory, it's gone," Schroeder says. "With MapR, if you do have a user or application error, you can go to a snapshot directory and recover."

Cloudera's Awadallah takes issue with the MapR platform because it's not open source. "Cloudera firmly believes in the superiority and the many short-term/long-term advantages of open source over proprietary implementations of Apache Hadoop. Open source Hadoop benefits from a lot of minds continually working to improve and test the code," he tells us.

"Most importantly, the reason why CIOs/CTOs love Apache Hadoop is eliminating the nontrivial risks of vendor lock in, i.e., nobody is holding a shotgun to their head asking them to pay more fees or else the [repository] hosting all their data gets nuked. Cloudera only makes money from our customers if we are delivering value to them, but if we cease delivering value then they can continue hosting their data in HDFS and use a different vendor for the support and management applications."

Cloudera offers its own completely open source Hadoop distro, but it also offers a for-pay "enterprise" version that puts various proprietary tools atop that platform.

Schroeder claims that MapR's platform is "at least two times faster" than existing Hadoop distros, citing standard benchmarks such as Terasort. This, he says, lets you run Hadoop on fewer servers. But this benchmarks cited by Schroeder may be optimized for certain workloads and not others. Schroeder was noncommittal about plans to open source the company's work or to offer its platform outside of OEMs such as EMC. ®

5 things you didn’t know about cloud backup

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
Docker kicks KVM's butt in IBM tests
Big Blue finds containers are speedy, but may not have much room to improve
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Gartner's Special Report: Should you believe the hype?
Enough hot air to carry a balloon to the Moon
Flash could be CHEAPER than SAS DISK? Come off it, NetApp
Stats analysis reckons we'll hit that point in just three years
Dell The Man shrieks: 'We've got a Bitcoin order, we've got a Bitcoin order'
$50k of PowerEdge servers? That'll be 85 coins in digi-dosh
prev story

Whitepapers

Endpoint data privacy in the cloud is easier than you think
Innovations in encryption and storage resolve issues of data privacy and key requirements for companies to look for in a solution.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Solving today's distributed Big Data backup challenges
Enable IT efficiency and allow a firm to access and reuse corporate information for competitive advantage, ultimately changing business outcomes.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.