Yahoo! seeds Hadoop startup on open source dream

Hortonworks hears a Big Data revolution

High performance access to file storage

Yahoo! is creating a new company with its core Hadoop engineering team, seeking to rapidly expand the scope of the open source distributed number-crunching platform and ultimately bring it to a much wider audience. In growing the Hadoop "ecosystem" through increased work on the core Apache-based open source project, the company hopes to eventually make its money by providing training and support for the platform.

"We believe that you should be able to get a fully-working version of Hadoop from Apache. There should not be any missing functionality," says Yahoo! vice president of engineering Eric Baldeschwieler, who will become the new company's CEO. "So, anything that's necessary to making Hadoop a complete, horizontal offering, we intend on building it in open source."



It's a commercial open source pitch of the purest kind. But it will be years before we can judge whether such an idealistic plan will actually work – and there's no guarantee the company will stick to the pitch.

The new company will be known as Hortonworks, a reference to the titular elephant from Dr. Seuss's Horton Hears a Who. Hadoop is named for a yellow stuffed elephant that once belonged to the son of project founder Doug Cutting.

Bearden hears a Hadoop

In late April, The Wall Street Journal reported that Yahoo! was "weighing" a Hadoop spinoff, and that it was discussing the possibility with Silicon Valley venture capital firm Benchmark Capital. At the time, Yahoo! would neither confirm nor deny the possibility with The Register. But earlier this week, GigaOM revealed that a Benchmark-backed venture was indeed on the way.

After hiring Doug Cutting in January 2006, Yahoo! bootstrapped the Hadoop project at Apache, and it is still the project's largest contributor. The platform has long underpinned Yahoo!'s online infrastructure, and for a while, the company offered its own Hadoop distro, based on the version of the software it ran internally. But in February, it discontinued this offering, choosing to put its weight behind the core Apache project, and somewhere along the way, Benchmark Capital approached the company about building a new startup around the project.

Benchmark was previously involved in such open source outfits as Red Hat, JBoss, SpringSource, and MySQL. Benchmark's Rob Bearden – who will serve as the chief operating officer of Hortonworks – played the same role at SpringSource before the Java framework house was sold to VMware. In the wake of the VMware acquisition, Bearden tells The Register, he and his colleagues began looking for the "biggest opportunity" in today's enterprise market, and they eventually settled on Hadoop.

"We looked at a lot of things, around social media and things like that," he says. "But it was very obvious, very quickly that being able to manage 'Big Data' is the biggest problem that CIOs have to solve, and they are looking for a new platform to do that with, as opposed to their existing relational [database] and [business intelligence] technologies. It was clear that Hadoop was the way they wanted to solve the problem."

The core Hadoop project is essentially a means of processing large amounts of data across clusters of low-cost machines. Consisting of the HDFS distributed file system and the Hadoop MapReduce platform that operates atop HDFS, it "maps" data-crunching tasks across a collection of distributed machines, splitting them into tiny sub-tasks, before "reducing" the results into one master calculation.

Benchmark considered investing in Cloudera, a Northern California startup that has already commercialized Hadoop. But Bearden and Benchmark didn't agree with the Cloudera business model. Cloudera uses what's sometimes called an "open core" model, offering its own open source Hadoop distro as well as a for-pay enterprise version of the platform that includes some additional proprietary tools.

"Our experience is that you have to have a pure-play model," Bearden says. "You have to be a packager or a distributor or you have to be an owner-creator. And to be that owner-creator, you have to have a majority of the committers under your company umbrella, and you have to embrace the open source methodology and the open source community."



There's a bit of a contradiction there. But the aim is to take hold of a majority of the open source project's core committers and expand the project as quickly as possible. Yahoo! had provided about 70 per cent of the Hadoop commits, and Benchmark felt this was the place to make things happen. It approached Yahoo! with the idea, and eventually, Yahoo! bit.

"It was a [pitch] well received," Bearden says. "A lot of the same thoughts were being explored at Yahoo!" Roughly twenty-five of Yahoo!'s Hadoop engineers will move to Hortonworks, including Baldeschwieler. Yahoo! will invest in the new company, which is expected to launch in July, and naturally, it will be a close partner. Baldeschwieler tells us that Hortonworks is getting Yahoo!'s "core expertise", but that some engineers on the fringes of Yahoo!'s Hadoop work will remain at the company.

Whose project is it, anyway?

Bearden insists that Hortonworks will not be a Hadoop consultant. It will provide Hadoop training and high-level support. But at least in the beginning, he says, the company's primary concern will be expanding the Apache Hadoop project. "As we make Hadoop more consumable as a platform, we create a vast ecosystem of companies and individuals that can build applications on it. Initially, we are going to be focused on the ease-of-consumption and productization of Hadoop for both the enterprise and the ecosystem in general."

Nonetheless, this puts Hortonworks in competition with Cloudera – an outfit founded by an all-star lineup of former Yahoo!, Google, Oracle, and Facebook employees – and EMC, which recently announced a for-pay Hadoop offering based on technology from Valley startup MapR. Currently, Cloudera provides support, services, and software for about 90 customers running the platform. EMC has yet to actually ship its Hadoop product, but thanks to MapR, it will provide key improvements to the Hadoop platform that are sure to please enterprise customers. The rub is that these improvements are closed source.

Despite Yahoo!'s claim to 70 per cent of Apache Hadoop commits, the open source project isn't necessarily centered on Yahoo!. In 2009, Doug Cutting left Yahoo! for Cloudera, where he's still on staff, and the startup also employs project cofounder Mike Cafarella. Facebook is another heavy contributor, and the platform is widely used by many other big web names.

Hadoop is based on research papers describing two of Google's proprietary back-end software platforms: GFS, its distributed file system, and MapReduce, the number-crunching piece. Cutting started the project for use with Nutch, his open source web crawler, but it grew into a much larger project when he joined Yahoo!. It now underpins Twitter and eBay as well as Facebook and Yahoo!.

Since the project was founded, it has been joined by myriad sister projects, including HBase (a real-time database based on Google BigTable), Hive (a SQL-like query language developed at Facebook), Sqoop (a MySQL connector built by Cloudera), Hue (a graphical user interface), and Zookeeper (a means of juggling distributed services from a central location that's based on Google's Chubby platform). ®

High performance access to file storage

More from The Register

next story
Windows 8.1, which you probably haven't upgraded to yet, ALREADY OBSOLETE
Pre-Update versions of new Windows version will no longer support patches
Android engineer: We DIDN'T copy Apple OR follow Samsung's orders
Veep testifies for Samsung during Apple patent trial
OpenSSL Heartbleed: Bloody nose for open-source bleeding hearts
Bloke behind the cockup says not enough people are helping crucial crypto project
Microsoft lobs pre-release Windows Phone 8.1 at devs who dare
App makers can load it before anyone else, but if they do they're stuck with it
Half of Twitter's 'active users' are SILENT STALKERS
Nearly 50% have NEVER tweeted a word
Windows XP still has 27 per cent market share on its deathbed
Windows 7 making some gains on XP Death Day
Internet-of-stuff startup dumps NoSQL for ... SQL?
NoSQL taste great at first but lacks proper nutrients, says startup cloud whiz
US taxman blows Win XP deadline, must now spend millions on custom support
Gov't IT likened to 'a Model T with a lot of things on top of it'
prev story


Mainstay ROI - Does application security pay?
In this whitepaper learn how you and your enterprise might benefit from better software security.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.