Hadoop: Making Linux gobble big data
Growing penguins need petabytes to feast on
Hadoop World The Hadoop big data muncher has grown into more than Yahoo! conceived when it open-sourced its search engine indexing tool and its underlying file system back in 2009. And it has become exactly what open-source projects aspire to be: a centre of gravity around which a maelstrom of innovation coalesces.
At the Hadoop World conference in New York hosted by Cloudera, the company's top brass and techies rallied some big-name advocates to talk about how they are using Hadoop in production and what the future holds for the data processing platform. (And yes, the Cloudera people are whippersnappers and they call it data processing, so what is old is new again.) Cloudera is one of several organizations that provides tech support for the Apache Hadoop stack as well as proprietary extensions to Hadoop – to make it better suited for companies to deploy alongside their traditional, database-driven back-end systems
Mike Olson, CEO at Cloudera, opened up his keynote by reminding everyone that "big data is more than size; it is complexity" and that "Hadoop is at the core of a data analytics platform, but you need more."
Among the 1,400 attendees at the two-day conference this week, 580 companies were represented, and Olson gathered up some interesting statistics that illuminate the issues actually facing Hadoop as it matures and becomes just another tool in the data centre. For one thing, big data is not all that big out there once you leave the Facebook, Google, Yahoo, eBay, and others out of the mix. Across those companies in attendance (and Facebook and eBay were there) the average cluster size is a mere 120 server nodes, up from 66 nodes a year ago at the Hadoop World 2010 event.
Of those companies in attendance, 44 per cent had Hadoop clusters with between 10 and 100 nodes, and 52 per cent had clusters between 100 and 1,000 nodes. The name node in a Hadoop cluster – roughly akin to a head node in a regular HPC cluster – starts choking at around 4,000 nodes, which is the scalability limit of Hadoop at the moment. Those 580 companies had a total of 202PB under management by Hadoop, a factor of 3.4 higher than last year. The node count didn't quite double, but the capacity more than tripled. The largest customer attending the event had 20PB under management of Hadoop, and 76 companies had between 100TB and 1PB and 74 companies had more than 1PB.
The remaining 4 per cent of the Hadoop customers in attendance are topping out their clusters in terms of node count. Earlier this year, executives at Yahoo! confirmed to El Reg that the company had 16 clusters running Hadoop with a total of 43,000 servers underpinning it, and the expectation was to have close to 60,000 nodes running Hadoop by the end of this year. Earlier this year, Yahoo! had over 200PB of data under Hadoop management all by itself. So it looks like Yahoo! data was not included in this poll data presented by Olson.
The point of rattling off those numbers, aside from being impressed with the girth and heft of all that iron, is to point out the obvious fact that big data is not really very big at all for most companies. At least not yet. Most companies don't yet know how to capture and use all of the unstructured data that could be mashed up with the operational data stored in their back-end systems. And they won't until the Hadoop stack becomes as polished and complete as a Linux distribution. And the indications are, at least from the roadmaps set out by the folks at Cloudera who are heavily involved in the development of Apache Hadoop, that we are pretty close to having an analog to Linux.
Maybe they should have called it Cuttix?
During the keynote sessions, Charles Zedlewski, vice president of products, and Eli Collins, senior engineer, walked down memory lane for Hadoop from its internal development at Yahoo! in 2006 and 2007 to open-source project in 2009 and then peered a little down a future lane to what might be coming down the pike on the back of an elephant.
Zedlewski got to start the history lesson, and talked about how the initial MapReduce algorithm and underlying Hadoop Distributed File System (HDFS) was "a great start, but also had fairly narrow use cases," mainly creating search engines and performing "click sessionization" (throwing custom data at you as you move around a website). At the time, all of the code that was contributed to the Hadoop project was related to the core HDFS and MapReduce.
The Hadoop stack next got a database (of sorts) and a set of high level programming languages that wrapped around the kernel. These include: HBase, the column-oriented distributed data store that rides on top of HDFS (and which arguably is not ready for primetime quite yet); Zookeeper, a configuration server for clusters that helps to make applications less brittle; and Mahout, a set of machine learning algorithms for taking what MapReduce tells apps about you others like you and making recommendations to you, among other things like automatic document classification. These projects, which were outside of the core Hadoop "kernel", represented 73 per cent of the code in the stack that year.
And even with a major upgrade to Hadoop with its open-sourcing in 2009, the addition of the Pig high-level data analytics language and the Hive data warehouse and SQL-like ad hoc query language, and the Avro file format, the core Hadoop kernel still only comprised 58 per cent of the code in 2009. By last year, the Hadoop kernel represented about 37 per cent of the total, with a slew of other independent projects wrapping around Hadoop to twist and extend it to new purposes.
Here's where the Hadoop distro stands today:
Now, said Zedlewski, there are a half-dozen alternative file systems for Hadoop and multiple compression algorithms, and new projects have sprung up to bet able to move data back and forth from Hadoop and relational database management systems, to stream log and event files right into Hadoop (Flume), and to also be able to absorb files of different formats right into Hadoop (Sqoop). And in addition to the original MapReduce computation, other methods of chewing on data are being added, including MPI, graph, and stream algorithms.
More kinds of data chewing
"As the data in Hadoop becomes more valuable, you will see other forms of computation moving to that data, not just MapReduce," said Collins. Which is funny, considering that the whole mantra of MapReduce was to move computation to data, not the other way around as data processing systems have been doing since the beginning of the computer era in more than six decades ago.
As the Hadoop stack has grown in complexity, the core use cases for the software have expanded, too. Now you can do batch reporting and more sophisticated data processing, and you can also use Hadoop to gather up log files and do real-time systems management. (This is, in fact, where many companies are cutting their teeth on Hadoop before they start using their customer data.) Companies are also using it for content serving and doing real-time aggregates and counters, and oddly enough, Hadoop is becoming a kind of storage controller. "As people use Hadoop for a long time, more of the data gets cold and it starts looking like storage," said Zedlewski.
Looking ahead, Collins said that getting consistency in the Hadoop stack, regardless of who puts together the distro and sells support for it, was going to be a major effort, all brought under the auspices of the BigTop effort. Many components in the Hadoop stack show above have different interfaces and release support levels, which makes it a bit of a nightmare to actually put together a distribution.
You still have to make compromises and choices, and that is not just bad for business customers who don't want to do Hadoop stack integration, but it is also bad for business for Hadoop disties because it increases support costs. There's also a lot of redundancy in the stack components, which only time will shake out. Moreover, HBase has cross-data centre replication, but the underlying HDFS does not. That needs to change not only for the biggest Hadoop users, but for any company that wants a hot site backup of their Hadoop operations. HBase is also expected to get development frameworks to make it more friendly to developers. And because businesses are crazy about security, they want Hadoop to get a more granular security model with access control lists.
The elephant is not exactly wearing a pinstripe suit and wingtips, but it is putting on a pair of khakis and a decent shirt. Unlike many of the Hadoop geeks presenting at the conference, in fact.
The other interesting trend Collins discussed is the underlying hardware. It will soon be common to have a Hadoop host with 40, 64, or 80 cores, and companies are looking at what happens with Hadoop clusters when they move to 10GE or 40GE networks. "One host is now more powerful than what a whole rack of servers was when Google got started," said Collins.
It is also common to have server nodes with 48TB or 60TB of capacity using fat SATA disks. "We even have people running entire Hadoop clusters with just flash," said Collins. Hadoop users are looking at how to make clusters multi-tenant and how server virtualization might fit in to accomplish this and to ease with the underlying management of servers. Companies are interested in low-power X86 processors to boost the node density of their clusters, they want scalable and fault-tolerant Hadoop name nodes, and they are even contemplating how to get MapReduce algorithms to work on GPU coprocessors.
This latter effort is being spearheaded by the oil and gas industry, which already has GPUs in their clusters, said Zedlewski, adding that "this is still a pretty bleeding edge use case". ®