Yahoo! seeds Hadoop startup on open source dream

Hortonworks hears a Big Data revolution

SANS - Survey on application security programs

Yahoo! is creating a new company with its core Hadoop engineering team, seeking to rapidly expand the scope of the open source distributed number-crunching platform and ultimately bring it to a much wider audience. In growing the Hadoop "ecosystem" through increased work on the core Apache-based open source project, the company hopes to eventually make its money by providing training and support for the platform.

"We believe that you should be able to get a fully-working version of Hadoop from Apache. There should not be any missing functionality," says Yahoo! vice president of engineering Eric Baldeschwieler, who will become the new company's CEO. "So, anything that's necessary to making Hadoop a complete, horizontal offering, we intend on building it in open source."



It's a commercial open source pitch of the purest kind. But it will be years before we can judge whether such an idealistic plan will actually work – and there's no guarantee the company will stick to the pitch.

The new company will be known as Hortonworks, a reference to the titular elephant from Dr. Seuss's Horton Hears a Who. Hadoop is named for a yellow stuffed elephant that once belonged to the son of project founder Doug Cutting.

Bearden hears a Hadoop

In late April, The Wall Street Journal reported that Yahoo! was "weighing" a Hadoop spinoff, and that it was discussing the possibility with Silicon Valley venture capital firm Benchmark Capital. At the time, Yahoo! would neither confirm nor deny the possibility with The Register. But earlier this week, GigaOM revealed that a Benchmark-backed venture was indeed on the way.

After hiring Doug Cutting in January 2006, Yahoo! bootstrapped the Hadoop project at Apache, and it is still the project's largest contributor. The platform has long underpinned Yahoo!'s online infrastructure, and for a while, the company offered its own Hadoop distro, based on the version of the software it ran internally. But in February, it discontinued this offering, choosing to put its weight behind the core Apache project, and somewhere along the way, Benchmark Capital approached the company about building a new startup around the project.

Benchmark was previously involved in such open source outfits as Red Hat, JBoss, SpringSource, and MySQL. Benchmark's Rob Bearden – who will serve as the chief operating officer of Hortonworks – played the same role at SpringSource before the Java framework house was sold to VMware. In the wake of the VMware acquisition, Bearden tells The Register, he and his colleagues began looking for the "biggest opportunity" in today's enterprise market, and they eventually settled on Hadoop.

"We looked at a lot of things, around social media and things like that," he says. "But it was very obvious, very quickly that being able to manage 'Big Data' is the biggest problem that CIOs have to solve, and they are looking for a new platform to do that with, as opposed to their existing relational [database] and [business intelligence] technologies. It was clear that Hadoop was the way they wanted to solve the problem."

The core Hadoop project is essentially a means of processing large amounts of data across clusters of low-cost machines. Consisting of the HDFS distributed file system and the Hadoop MapReduce platform that operates atop HDFS, it "maps" data-crunching tasks across a collection of distributed machines, splitting them into tiny sub-tasks, before "reducing" the results into one master calculation.

Benchmark considered investing in Cloudera, a Northern California startup that has already commercialized Hadoop. But Bearden and Benchmark didn't agree with the Cloudera business model. Cloudera uses what's sometimes called an "open core" model, offering its own open source Hadoop distro as well as a for-pay enterprise version of the platform that includes some additional proprietary tools.

"Our experience is that you have to have a pure-play model," Bearden says. "You have to be a packager or a distributor or you have to be an owner-creator. And to be that owner-creator, you have to have a majority of the committers under your company umbrella, and you have to embrace the open source methodology and the open source community."



There's a bit of a contradiction there. But the aim is to take hold of a majority of the open source project's core committers and expand the project as quickly as possible. Yahoo! had provided about 70 per cent of the Hadoop commits, and Benchmark felt this was the place to make things happen. It approached Yahoo! with the idea, and eventually, Yahoo! bit.

"It was a [pitch] well received," Bearden says. "A lot of the same thoughts were being explored at Yahoo!" Roughly twenty-five of Yahoo!'s Hadoop engineers will move to Hortonworks, including Baldeschwieler. Yahoo! will invest in the new company, which is expected to launch in July, and naturally, it will be a close partner. Baldeschwieler tells us that Hortonworks is getting Yahoo!'s "core expertise", but that some engineers on the fringes of Yahoo!'s Hadoop work will remain at the company.

Whose project is it, anyway?

Bearden insists that Hortonworks will not be a Hadoop consultant. It will provide Hadoop training and high-level support. But at least in the beginning, he says, the company's primary concern will be expanding the Apache Hadoop project. "As we make Hadoop more consumable as a platform, we create a vast ecosystem of companies and individuals that can build applications on it. Initially, we are going to be focused on the ease-of-consumption and productization of Hadoop for both the enterprise and the ecosystem in general."

Nonetheless, this puts Hortonworks in competition with Cloudera – an outfit founded by an all-star lineup of former Yahoo!, Google, Oracle, and Facebook employees – and EMC, which recently announced a for-pay Hadoop offering based on technology from Valley startup MapR. Currently, Cloudera provides support, services, and software for about 90 customers running the platform. EMC has yet to actually ship its Hadoop product, but thanks to MapR, it will provide key improvements to the Hadoop platform that are sure to please enterprise customers. The rub is that these improvements are closed source.

Despite Yahoo!'s claim to 70 per cent of Apache Hadoop commits, the open source project isn't necessarily centered on Yahoo!. In 2009, Doug Cutting left Yahoo! for Cloudera, where he's still on staff, and the startup also employs project cofounder Mike Cafarella. Facebook is another heavy contributor, and the platform is widely used by many other big web names.

Hadoop is based on research papers describing two of Google's proprietary back-end software platforms: GFS, its distributed file system, and MapReduce, the number-crunching piece. Cutting started the project for use with Nutch, his open source web crawler, but it grew into a much larger project when he joined Yahoo!. It now underpins Twitter and eBay as well as Facebook and Yahoo!.

Since the project was founded, it has been joined by myriad sister projects, including HBase (a real-time database based on Google BigTable), Hive (a SQL-like query language developed at Facebook), Sqoop (a MySQL connector built by Cloudera), Hue (a graphical user interface), and Zookeeper (a means of juggling distributed services from a central location that's based on Google's Chubby platform). ®

Top three mobile application threats

More from The Register

next story
Ubuntu 14.04 LTS: Great changes, but sssh don't mention the...
Why HELLO Amazon! You weren't here last time
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Next Windows obsolescence panic is 450 days from … NOW!
The clock is ticking louder for Windows Server 2003 R2 users
Half of Twitter's 'active users' are SILENT STALKERS
Nearly 50% have NEVER tweeted a word
Microsoft TIER SMEAR changes app prices whether devs ask or not
Some go up, some go down, Redmond goes silent
Batten down the hatches, Ubuntu 14.04 LTS due in TWO DAYS
Admins dab straining server brows in advance of Trusty Tahr's long-term support landing
Red Hat to ship RHEL 7 release candidate with a taste of container tech
Grab 'near-final' version of next Enterprise Linux next week
Windows 8.1, which you probably haven't upgraded to yet, ALREADY OBSOLETE
Pre-Update versions of new Windows version will no longer support patches
prev story


Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.
Combat fraud and increase customer satisfaction
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
SANS - Survey on application security programs
In this whitepaper learn about the state of application security programs and practices of 488 surveyed respondents, and discover how mature and effective these programs are.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.