Feeds

Former Yahoo! Hadoop honcho uncloaks from stealth

Making elephants dance in real-time

Internet Security Threat Report 2014

Structure Data 2012 Yet another big data startup has uncloaked which traces its roots in the Hadoop MapReduce project started by, and open sourced by, Yahoo!

At the Structure Data 2012 conference in New York on Wednesday, Todd Papaioannou, formerly vice president of cloud architecture at the internet media company and the head honcho through the years when Yahoo! took its Hadoop MapReduce engine and Hadoop Distributed File System open source as an Apache project, gave a brief – and not particularly detailed – introduction to the company he has founded, called Continuuity. (Yeah, that's with a double U, just to annoy the crap out of us.)

Papaioannou said that the trick in this business is to capture the "digital exhaust" – a term coined by Google – that we emit as we go about our lives on the internet and to make use of that vastly expanding amount of data we are emitting. I would call it digital yeast myself, but as a homebrewer, you'd expect that sort of thing. Perhaps digital methane might be even more accurate. Gartner says that the amount of data will grow by 800 per cent over the next four years, and that 80 per cent of that data will be unstructured.

Being from Yahoo!, you would expect for Papaioannou to say that consumer intelligence – something that his new company will be focused on – is "the first archetypal application pattern that is emerging in the big data space." But it is more than just targeting people on the web to serve them ads, content, and deals, or to do sentiment analysis.

To prove his point of how dramatic an effect that this consumer intelligence can be, Papaioannou trotted out some statistics from some of the batch Hadoop operations at Yahoo!

Back in the day, Yahoo! News gave everyone the same homepage. But after gathering up data about Yahoo! users who go to the news site, and not only serving them up more appropriate ads, but also serving them up precisely targeted content – over 3 million different homepages were generated for the news site – Yahoo! was able to increase the clickthrough rate for news by more than 300 per cent.

"Obviously human editors could not have done that," explained Papaioannou, which is why Yahoo! build a content serving engine that figured out what pages to serve by mining the data about the stories that users actually read each day and feeding them more of the right stuff to keep them reading.

The problem with all of this is that the Hadoop backend is that Hadoop is batch oriented. "Hadoop has been a fantastic platform for doing that," says Papaioannou. "But actually, the web is moving towards much more of a real-time experience, people are expecting much more of a real-time experience."

It doesn't matter if you can sift through mounds of data to find "the signal" buried in it that tells you what to do with and end user, says Papaioannou. You need to be able to act on that signal in real-time." And Yahoo! was not able to get Hadoop to run in real-time, despite the MapReduce Online and S4 efforts.

There has been an evolution from relational databases, which didn't scale very well, to sharded databases (like distributed MySQL) to try to move from enterprise to hyperscale Web application scales, explains Papaioannou, who threw up this quick graphic:

 Continuuity big data spectrum

The evolution of big data, according to Todd P and Continuuity (click to enlarge)

To get around the scalability barriers of the traditional relational database, you shatter the database and distribute it across a bunch of database nodes, which all fed into the compute node working on the data. With systems like Hadoop, you have a single master node that controls the MapReduce job that is crunching data, and you actually move the necessary compute jobs out to run on top of the data store nodes.

But in the future, what Papaioannou envisions is that companies will not wait to run their log files and clickstreams through a batch-oriented Hadoop cluster, but rather run the data through the compute nodes and process it all in real time, and presumably in a massively parallel fashion. He was not precise about how this might be accomplished.

"This is a pretty fundamental change compared to the application architecture of Hadoop," says Papaioannou. "You walk around here and you see a lot of people talking about real-time. It's not clear to me, as an industry, that we have nailed that problem. It is clear to me that we need to solve that problem, and that the next big wave of applications is going to be real-time and to get to real-time, you have to take the human out of the loop." Just like Yahoo! did with its homepage.

Presumably, Continuuity will have something to do with that. ®

Beginner's guide to SSL certificates

More from The Register

next story
Docker's app containers are coming to Windows Server, says Microsoft
MS chases app deployment speeds already enjoyed by Linux devs
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
'Urika': Cray unveils new 1,500-core big data crunching monster
6TB of DRAM, 38TB of SSD flash and 120TB of disk storage
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
SDI wars: WTF is software defined infrastructure?
This time we play for ALL the marbles
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
Oracle hires former SAP exec for cloudy push
'We know Larry said cloud was gibberish, and insane, and idiotic, but...'
Symantec backs out of Backup Exec: Plans to can appliance in Jan
Will still provide support to existing customers
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Win a year’s supply of chocolate
There is no techie angle to this competition so we're not going to pretend there is, but everyone loves chocolate so who cares.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.