Intel takes on all Hadoop disties to rule big data munching

'We do software now. Get used to it.'

Internet Security Threat Report 2014

Look out Cloudera, MapR Technologies, EMC, Hortonworks, and IBM: Intel is the new elephant in the room. Intel has been dabbling for the past two years with its own distribution of the Hadoop stack, and starting in the second quarter it will begin selling services for its own variant of the Hadoop big data muncher.

Intel is not just moving into Hadoop software and support to make some money on software and services, explained Boyd Davis, general manager of the software division of Intel's Data Center and Connected Systems Group. It also wants to accelerate the adoption of Xeon and Atom processors, as well as Intel networking and solid state storage when used in conjunction with Hadoop clusters.

In a nutshell, Intel thinks it has to do a Hadoop distribution itself to make sure it makes the server, storage, and networking money. And more importantly, Intel is thinking of Hadoop as a kind of foundational technology that is up for grabs.

Given this attitude, it is amazing that Intel doesn't just go all the way and say it needs its own Linux operating system, too. And if you think Intel is being logically inconsistent, Boyd said that with the software competency that Intel has now, it might have jumped in, had the Linux wave started today rather than in the late 1990s.

"Intel is a company that has grown in our capabilities around software a lot since the start of the Linux movement," explained Boyd at the launch event for the Intel Inside Hadoopery. "And of course, we have great partners in the Linux community that are driving the kernel forward. Who is to say if the Linux phenomenon were happening today we wouldn't try to do the same thing? I think it is a little bit more of when we have an ability to anticipate a trend as big as the Hadoop trend, if it is as big as we expect it to be. I think that just by taking an influence model we wouldn't have an opportunity to grow the market as fast."

Intel's desire, explained Boyd, was to speed up enhancements to the Hadoop tools as they run on Intel processors, chipsets, network controllers, and switch ASICs. But it is also to "add more value," which is another way of saying Intel wants more share of IT wallet.

And by the same logic, Intel could decide that the OpenStack cloud control freak is foundational technology. Or Linux. Or maybe the KVM or Xen hypervisor.  Or any number of other things, like Open Compute Project servers.

The fact is, Intel sees an opportunity in very profitable software, and it is not about to miss this one as it has so many others in the past. Like everyone else, Intel knows it needs to be a software and services company, and it is Boyd's job to make that happen in the data center. And Hadoop is the best option Intel has.

Intel has been working on Hadoop technologies since it forged a benchmarking partnership with Yahoo! and HP back in 2009, and it has gradually come to the conclusion that its expertise in designing hardware and tuning software for it gives it some advantages over other commercial Hadoop disties.

Intel has been gradually moving towards a formal Hadoop distro for years

Intel has been gradually moving towards a formal Hadoop distro for years

As it turns out, after working on benchmarking and tuning, Intel was approached by China Unicom and China Mobile to help cope with some performance issues in the Hadoop stack when running on Intel Xeon processors.

In 2011 Chipzilla, rolled up a 1.0 release of what is now being called the Intel Distribution for Apache Hadoop – presumably to be called IDH1, dropping the A the way Cloudera does it, because IDAH 1.0 looks sillier. Last fall, at the behest of these Chinese customers and to get a few new customers like publisher McGraw-Hill and genomics researcher Nextbio on board, Intel rolled up IDH2.

With the IDH3 release, Intel is grabbing the Apache Hadoop 2.0 stack and doing some significant tweaking and tuning, explained Boyd. Intel has done a lot of work with the Hadoop Distributed File System, the YARN MapReduce 2.0 distributed processing framework, the Hive SQL query tool, and the HBase key-value store, he added.

How Intel is building its Hadoop stack

How Intel is building its Hadoop stack

In the chart above, the elements in light blue are where Intel has made substantial performance enhancements, and the Hadoop stack modules in gray are just bits Intel is pulling directly from the Apache Hadoop project. Any code Intel has created to enhance any of the light blue modules will be eventually contributed back to the Apache Hadoop community; whether the community accept these changes into the Hadoop stack is another matter, and one for its members to decide. How quickly these are contributed is for Intel to decide.

And Intel promises not to fork the Hadoop code, too. "The goal is not to fork the code or drive any kind of dissension in the community," Boyd said.

That dark blue area at the top concerned Intel's Manager for Apache Hadoop, Chipzilla's very own control freak, which will not be open sourced. This tool will be used to deploy, configure, manage, and monitor the Intel Hadoop stack, and it will integrate with other Intel tools, such as its Data Center Manager power utilization control freak, its own rendition of the Lustre clustered file system, its cache acceleration software, and its Expressway service gateway.

The neat feature in this Manager for Apache Hadoop is that it is aware of AES-NI encryption and AVX and SSE4.2 instructions on Xeon E5 processors, as well as the presence of solid state disks and Intel network cards. It also has a feature called Active Tuner that Boyd says works by "using Hadoop on Hadoop" to do analytics on power and performance telemetry coming out of a Hadoop cluster to automagically tune it for a specific cluster. (Of course, it will no doubt work best on all-Intel technology.)

Internet Security Threat Report 2014

More from The Register

next story
The cloud that goes puff: Seagate Central home NAS woes
4TB of home storage is great, until you wake up to a dead device
Fat fingered geo-block kept Aussies in the dark
You think the CLOUD's insecure? It's BETTER than UK.GOV's DATA CENTRES
We don't even know where some of them ARE – Maude
Intel offers ingenious piece of 10TB 3D NAND chippery
The race for next generation flash capacity now on
Want to STUFF Facebook with blatant ADVERTISING? Fine! But you must PAY
Pony up or push off, Zuck tells social marketeers
Oi, Europe! Tell US feds to GTFO of our servers, say Microsoft and pals
By writing a really angry letter about how it's harming our cloud business, ta
SAVE ME, NASA system builder, from my DEAD WORKSTATION
Anal-retentive hardware nerd in paws-on workstation crisis
prev story


Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Managing SSL certificates with ease
The lack of operational efficiencies and compliance pitfalls associated with poor SSL certificate management, and how the right SSL certificate management tool can help.
Top 5 reasons to deploy VMware with Tegile
Data demand and the rise of virtualization is challenging IT teams to deliver storage performance, scalability and capacity that can keep up, while maximizing efficiency.