Yahoo! seeds Hadoop startup on open source dream

Hortonworks hears a Big Data revolution

Security for virtualized datacentres

Yahoo! is creating a new company with its core Hadoop engineering team, seeking to rapidly expand the scope of the open source distributed number-crunching platform and ultimately bring it to a much wider audience. In growing the Hadoop "ecosystem" through increased work on the core Apache-based open source project, the company hopes to eventually make its money by providing training and support for the platform.

"We believe that you should be able to get a fully-working version of Hadoop from Apache. There should not be any missing functionality," says Yahoo! vice president of engineering Eric Baldeschwieler, who will become the new company's CEO. "So, anything that's necessary to making Hadoop a complete, horizontal offering, we intend on building it in open source."



It's a commercial open source pitch of the purest kind. But it will be years before we can judge whether such an idealistic plan will actually work – and there's no guarantee the company will stick to the pitch.

The new company will be known as Hortonworks, a reference to the titular elephant from Dr. Seuss's Horton Hears a Who. Hadoop is named for a yellow stuffed elephant that once belonged to the son of project founder Doug Cutting.

Bearden hears a Hadoop

In late April, The Wall Street Journal reported that Yahoo! was "weighing" a Hadoop spinoff, and that it was discussing the possibility with Silicon Valley venture capital firm Benchmark Capital. At the time, Yahoo! would neither confirm nor deny the possibility with The Register. But earlier this week, GigaOM revealed that a Benchmark-backed venture was indeed on the way.

After hiring Doug Cutting in January 2006, Yahoo! bootstrapped the Hadoop project at Apache, and it is still the project's largest contributor. The platform has long underpinned Yahoo!'s online infrastructure, and for a while, the company offered its own Hadoop distro, based on the version of the software it ran internally. But in February, it discontinued this offering, choosing to put its weight behind the core Apache project, and somewhere along the way, Benchmark Capital approached the company about building a new startup around the project.

Benchmark was previously involved in such open source outfits as Red Hat, JBoss, SpringSource, and MySQL. Benchmark's Rob Bearden – who will serve as the chief operating officer of Hortonworks – played the same role at SpringSource before the Java framework house was sold to VMware. In the wake of the VMware acquisition, Bearden tells The Register, he and his colleagues began looking for the "biggest opportunity" in today's enterprise market, and they eventually settled on Hadoop.

"We looked at a lot of things, around social media and things like that," he says. "But it was very obvious, very quickly that being able to manage 'Big Data' is the biggest problem that CIOs have to solve, and they are looking for a new platform to do that with, as opposed to their existing relational [database] and [business intelligence] technologies. It was clear that Hadoop was the way they wanted to solve the problem."

The core Hadoop project is essentially a means of processing large amounts of data across clusters of low-cost machines. Consisting of the HDFS distributed file system and the Hadoop MapReduce platform that operates atop HDFS, it "maps" data-crunching tasks across a collection of distributed machines, splitting them into tiny sub-tasks, before "reducing" the results into one master calculation.

Benchmark considered investing in Cloudera, a Northern California startup that has already commercialized Hadoop. But Bearden and Benchmark didn't agree with the Cloudera business model. Cloudera uses what's sometimes called an "open core" model, offering its own open source Hadoop distro as well as a for-pay enterprise version of the platform that includes some additional proprietary tools.

"Our experience is that you have to have a pure-play model," Bearden says. "You have to be a packager or a distributor or you have to be an owner-creator. And to be that owner-creator, you have to have a majority of the committers under your company umbrella, and you have to embrace the open source methodology and the open source community."



There's a bit of a contradiction there. But the aim is to take hold of a majority of the open source project's core committers and expand the project as quickly as possible. Yahoo! had provided about 70 per cent of the Hadoop commits, and Benchmark felt this was the place to make things happen. It approached Yahoo! with the idea, and eventually, Yahoo! bit.

"It was a [pitch] well received," Bearden says. "A lot of the same thoughts were being explored at Yahoo!" Roughly twenty-five of Yahoo!'s Hadoop engineers will move to Hortonworks, including Baldeschwieler. Yahoo! will invest in the new company, which is expected to launch in July, and naturally, it will be a close partner. Baldeschwieler tells us that Hortonworks is getting Yahoo!'s "core expertise", but that some engineers on the fringes of Yahoo!'s Hadoop work will remain at the company.

Whose project is it, anyway?

Bearden insists that Hortonworks will not be a Hadoop consultant. It will provide Hadoop training and high-level support. But at least in the beginning, he says, the company's primary concern will be expanding the Apache Hadoop project. "As we make Hadoop more consumable as a platform, we create a vast ecosystem of companies and individuals that can build applications on it. Initially, we are going to be focused on the ease-of-consumption and productization of Hadoop for both the enterprise and the ecosystem in general."

Nonetheless, this puts Hortonworks in competition with Cloudera – an outfit founded by an all-star lineup of former Yahoo!, Google, Oracle, and Facebook employees – and EMC, which recently announced a for-pay Hadoop offering based on technology from Valley startup MapR. Currently, Cloudera provides support, services, and software for about 90 customers running the platform. EMC has yet to actually ship its Hadoop product, but thanks to MapR, it will provide key improvements to the Hadoop platform that are sure to please enterprise customers. The rub is that these improvements are closed source.

Despite Yahoo!'s claim to 70 per cent of Apache Hadoop commits, the open source project isn't necessarily centered on Yahoo!. In 2009, Doug Cutting left Yahoo! for Cloudera, where he's still on staff, and the startup also employs project cofounder Mike Cafarella. Facebook is another heavy contributor, and the platform is widely used by many other big web names.

Hadoop is based on research papers describing two of Google's proprietary back-end software platforms: GFS, its distributed file system, and MapReduce, the number-crunching piece. Cutting started the project for use with Nutch, his open source web crawler, but it grew into a much larger project when he joined Yahoo!. It now underpins Twitter and eBay as well as Facebook and Yahoo!.

Since the project was founded, it has been joined by myriad sister projects, including HBase (a real-time database based on Google BigTable), Hive (a SQL-like query language developed at Facebook), Sqoop (a MySQL connector built by Cloudera), Hue (a graphical user interface), and Zookeeper (a means of juggling distributed services from a central location that's based on Google's Chubby platform). ®

Website security in corporate America

More from The Register

next story
New 'Cosmos' browser surfs the net by TXT alone
No data plan? No WiFi? No worries ... except sluggish download speed
'Windows 9' LEAK: Microsoft's playing catchup with Linux
Multiple desktops and live tiles in restored Start button star in new vids
iOS 8 release: WebGL now runs everywhere. Hurrah for 3D graphics!
HTML 5's pretty neat ... when your browser supports it
'People have forgotten just how late the first iPhone arrived ...'
Plus: 'Google's IDEALISM is an injudicious justification for inappropriate biz practices'
Mathematica hits the Web
Wolfram embraces the cloud, promies private cloud cut of its number-cruncher
Mozilla shutters Labs, tells nobody it's been dead for five months
Staffer's blog reveals all as projects languish on GitHub
SUSE Linux owner Attachmate gobbled by Micro Focus for $2.3bn
Merger will lead to mainframe and COBOL powerhouse
iOS 8 Healthkit gets a bug SO Apple KILLS it. That's real healthcare!
Not fit for purpose on day of launch, says Cupertino
prev story


Secure remote control for conventional and virtual desktops
Balancing user privacy and privileged access, in accordance with compliance frameworks and legislation. Evaluating any potential remote control choice.
WIN a very cool portable ZX Spectrum
Win a one-off portable Spectrum built by legendary hardware hacker Ben Heck
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
The next step in data security
With recent increased privacy concerns and computers becoming more powerful, the chance of hackers being able to crack smaller-sized RSA keys increases.