Feeds

Upstart MapR unveils 'next generation' Hadoop

Stuffed elephant gets major surgery

  • alert
  • submit to reddit

Top three mobile application threats

Silicon Valley startup MapR has unveiled what it calls the "next generation" of Hadoop, revealing that this revamp of the open source distributed number-crunching platform drives the Hadoop appliance recently announced by EMC.

According to MapR CEO and cofounder John Schroeder, the company has rebuilt Hadoop's storage system, job tracker, and name space, providing more rapid access to data and improving the platform's ability to scale. MapR's Hadoop is not entirely open source, Schroeder says. It is now in use by "quite a few" customers, including EMC, whose Greenplum HD Enterprise Edition of Hadoop is essentially MapR technology.

Based on Google's back-end infrastructure and named after a yellow stuffed elephant, Hadoop offers a distributed file system (HDFS) and a distributed number-crunching platform (Hadoop MapReduce) as well as various other tools. It's typically used for offline data processing, though a sister project, Hbase, offers a real-time distributed database.

The existing Hadoop storage system, Schroeder says, is too limiting. "It's kind of like writing to a CD-ROM. You can write data to a file, but you can't do random reads and writes, and you can't do multiple concurrent readers and writers. It forces Hadoop to become more of a batch mode platform," Schroeder tells The Register. Over the past two years, Schroeder and his team have revamped the storage system, working to ease access to data not only from Hadoop itself but from other platforms as well.

This involved setting up a network file system (NFS) mount. "What we've done is rearchitect the storage services so that they provide random read and random write for multiple readers and writers, and then we expose that as an NFS mount, so in addition to being able to use that data from Hadoop APIs, you can use all your standard Linux tools and Unix tools and applications," he says. "You can create real-time data streaming out of Hadoop. You can make Hadoop look like a big C: drive on your Windows desktop."

MapR has also revamped Hadoop's Job Tracker, which distributes jobs across a Hadoop cluster and then manages their execution. Typically, Job Tracker is a single process on a single machine, but MapR has introduced a Job Tracker that's distributed across several machines. "Job Tracker failures are fairly common. They have a tendency to crash clusters and leave applications in an unknown state, so we've introduced a high-availability Job Tracker," Schroeder says.

Similarly, the company has rebuilt another single point of failure: NameNode, Hadoop's global namespace. MapR offers a distributed NameNode that scales to more files. "If you pull a node out of the cluster and it's running Job Tracker or NameNode, everything will continue to run without any impact."

Ben Werther – vice president of products at DataStax, a company that has put Hadoop MapReduce atop the open source Cassandra database and file system with a platform called Brisk – told us that the NameNode single point of failure is a "major, major problem" with Hadoop HDFS. Brisk avoids the problem because it does not use HDFS.

Facebook is already running a version of HDFS with no single point of failure for its new Messages platform, which makes use of HBase. And according to Amr Awadallah, the CTO of Hadoop startup Cloudera, the open source version of Hadoop will eliminate the single point of failure in "a few months".

Separately, MapR has introduced a new "data protection layer" with its Hadoop, letting you take a snapshot of your data every so often for recovery purposes. "Hadoop offer replication. You can offer multiple copies of data. But that doesn't guard you against user error. If I come in and delete a directory, it's gone," Schroeder says. "With MapR, if you do have a user or application error, you can go to a snapshot directory and recover."

Cloudera's Awadallah takes issue with the MapR platform because it's not open source. "Cloudera firmly believes in the superiority and the many short-term/long-term advantages of open source over proprietary implementations of Apache Hadoop. Open source Hadoop benefits from a lot of minds continually working to improve and test the code," he tells us.

"Most importantly, the reason why CIOs/CTOs love Apache Hadoop is eliminating the nontrivial risks of vendor lock in, i.e., nobody is holding a shotgun to their head asking them to pay more fees or else the [repository] hosting all their data gets nuked. Cloudera only makes money from our customers if we are delivering value to them, but if we cease delivering value then they can continue hosting their data in HDFS and use a different vendor for the support and management applications."

Cloudera offers its own completely open source Hadoop distro, but it also offers a for-pay "enterprise" version that puts various proprietary tools atop that platform.

Schroeder claims that MapR's platform is "at least two times faster" than existing Hadoop distros, citing standard benchmarks such as Terasort. This, he says, lets you run Hadoop on fewer servers. But this benchmarks cited by Schroeder may be optimized for certain workloads and not others. Schroeder was noncommittal about plans to open source the company's work or to offer its platform outside of OEMs such as EMC. ®

High performance access to file storage

More from The Register

next story
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Kingston DataTraveler MicroDuo: Turn your phone into a 72GB beast
USB-usiness in the front, micro-USB party in the back
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Inside the Hekaton: SQL Server 2014's database engine deconstructed
Nadella's database sqares the circle of cheap memory vs speed
BOFH: Oh DO tell us what you think. *CLICK*
$%%&amp Oh dear, we've been cut *CLICK* Well hello *CLICK* You're breaking up...
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
IT bods: How long does it take YOU to train up on new tech?
I'll leave my arrays to do the hard work, if you don't mind
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
prev story

Whitepapers

Top three mobile application threats
Learn about three of the top mobile application security threats facing businesses today and recommendations on how to mitigate the risk.
Combat fraud and increase customer satisfaction
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
SANS - Survey on application security programs
In this whitepaper learn about the state of application security programs and practices of 488 surveyed respondents, and discover how mature and effective these programs are.