Feeds

Upstart MapR unveils 'next generation' Hadoop

Stuffed elephant gets major surgery

  • alert
  • submit to reddit

Combat fraud and increase customer satisfaction

Silicon Valley startup MapR has unveiled what it calls the "next generation" of Hadoop, revealing that this revamp of the open source distributed number-crunching platform drives the Hadoop appliance recently announced by EMC.

According to MapR CEO and cofounder John Schroeder, the company has rebuilt Hadoop's storage system, job tracker, and name space, providing more rapid access to data and improving the platform's ability to scale. MapR's Hadoop is not entirely open source, Schroeder says. It is now in use by "quite a few" customers, including EMC, whose Greenplum HD Enterprise Edition of Hadoop is essentially MapR technology.

Based on Google's back-end infrastructure and named after a yellow stuffed elephant, Hadoop offers a distributed file system (HDFS) and a distributed number-crunching platform (Hadoop MapReduce) as well as various other tools. It's typically used for offline data processing, though a sister project, Hbase, offers a real-time distributed database.

The existing Hadoop storage system, Schroeder says, is too limiting. "It's kind of like writing to a CD-ROM. You can write data to a file, but you can't do random reads and writes, and you can't do multiple concurrent readers and writers. It forces Hadoop to become more of a batch mode platform," Schroeder tells The Register. Over the past two years, Schroeder and his team have revamped the storage system, working to ease access to data not only from Hadoop itself but from other platforms as well.

This involved setting up a network file system (NFS) mount. "What we've done is rearchitect the storage services so that they provide random read and random write for multiple readers and writers, and then we expose that as an NFS mount, so in addition to being able to use that data from Hadoop APIs, you can use all your standard Linux tools and Unix tools and applications," he says. "You can create real-time data streaming out of Hadoop. You can make Hadoop look like a big C: drive on your Windows desktop."

MapR has also revamped Hadoop's Job Tracker, which distributes jobs across a Hadoop cluster and then manages their execution. Typically, Job Tracker is a single process on a single machine, but MapR has introduced a Job Tracker that's distributed across several machines. "Job Tracker failures are fairly common. They have a tendency to crash clusters and leave applications in an unknown state, so we've introduced a high-availability Job Tracker," Schroeder says.

Similarly, the company has rebuilt another single point of failure: NameNode, Hadoop's global namespace. MapR offers a distributed NameNode that scales to more files. "If you pull a node out of the cluster and it's running Job Tracker or NameNode, everything will continue to run without any impact."

Ben Werther – vice president of products at DataStax, a company that has put Hadoop MapReduce atop the open source Cassandra database and file system with a platform called Brisk – told us that the NameNode single point of failure is a "major, major problem" with Hadoop HDFS. Brisk avoids the problem because it does not use HDFS.

Facebook is already running a version of HDFS with no single point of failure for its new Messages platform, which makes use of HBase. And according to Amr Awadallah, the CTO of Hadoop startup Cloudera, the open source version of Hadoop will eliminate the single point of failure in "a few months".

Separately, MapR has introduced a new "data protection layer" with its Hadoop, letting you take a snapshot of your data every so often for recovery purposes. "Hadoop offer replication. You can offer multiple copies of data. But that doesn't guard you against user error. If I come in and delete a directory, it's gone," Schroeder says. "With MapR, if you do have a user or application error, you can go to a snapshot directory and recover."

Cloudera's Awadallah takes issue with the MapR platform because it's not open source. "Cloudera firmly believes in the superiority and the many short-term/long-term advantages of open source over proprietary implementations of Apache Hadoop. Open source Hadoop benefits from a lot of minds continually working to improve and test the code," he tells us.

"Most importantly, the reason why CIOs/CTOs love Apache Hadoop is eliminating the nontrivial risks of vendor lock in, i.e., nobody is holding a shotgun to their head asking them to pay more fees or else the [repository] hosting all their data gets nuked. Cloudera only makes money from our customers if we are delivering value to them, but if we cease delivering value then they can continue hosting their data in HDFS and use a different vendor for the support and management applications."

Cloudera offers its own completely open source Hadoop distro, but it also offers a for-pay "enterprise" version that puts various proprietary tools atop that platform.

Schroeder claims that MapR's platform is "at least two times faster" than existing Hadoop distros, citing standard benchmarks such as Terasort. This, he says, lets you run Hadoop on fewer servers. But this benchmarks cited by Schroeder may be optimized for certain workloads and not others. Schroeder was noncommittal about plans to open source the company's work or to offer its platform outside of OEMs such as EMC. ®

Combat fraud and increase customer satisfaction

More from The Register

next story
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Kingston DataTraveler MicroDuo: Turn your phone into a 72GB beast
USB-usiness in the front, micro-USB party in the back
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
BOFH: Oh DO tell us what you think. *CLICK*
$%%&amp Oh dear, we've been cut *CLICK* Well hello *CLICK* You're breaking up...
AMD's 'Seattle' 64-bit ARM server chips now sampling, set to launch in late 2014
But they won't appear in SeaMicro Fabric Compute Systems anytime soon
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
Cisco reps flog Whiptail's Invicta arrays against EMC and Pure
Storage reseller report reveals who's selling what
prev story

Whitepapers

Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Top three mobile application threats
Learn about three of the top mobile application security threats facing businesses today and recommendations on how to mitigate the risk.
Combat fraud and increase customer satisfaction
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.