Feeds

Former Yahoo! Hadoop honcho uncloaks from stealth

Making elephants dance in real-time

Intelligent flash storage arrays

Structure Data 2012 Yet another big data startup has uncloaked which traces its roots in the Hadoop MapReduce project started by, and open sourced by, Yahoo!

At the Structure Data 2012 conference in New York on Wednesday, Todd Papaioannou, formerly vice president of cloud architecture at the internet media company and the head honcho through the years when Yahoo! took its Hadoop MapReduce engine and Hadoop Distributed File System open source as an Apache project, gave a brief – and not particularly detailed – introduction to the company he has founded, called Continuuity. (Yeah, that's with a double U, just to annoy the crap out of us.)

Papaioannou said that the trick in this business is to capture the "digital exhaust" – a term coined by Google – that we emit as we go about our lives on the internet and to make use of that vastly expanding amount of data we are emitting. I would call it digital yeast myself, but as a homebrewer, you'd expect that sort of thing. Perhaps digital methane might be even more accurate. Gartner says that the amount of data will grow by 800 per cent over the next four years, and that 80 per cent of that data will be unstructured.

Being from Yahoo!, you would expect for Papaioannou to say that consumer intelligence – something that his new company will be focused on – is "the first archetypal application pattern that is emerging in the big data space." But it is more than just targeting people on the web to serve them ads, content, and deals, or to do sentiment analysis.

To prove his point of how dramatic an effect that this consumer intelligence can be, Papaioannou trotted out some statistics from some of the batch Hadoop operations at Yahoo!

Back in the day, Yahoo! News gave everyone the same homepage. But after gathering up data about Yahoo! users who go to the news site, and not only serving them up more appropriate ads, but also serving them up precisely targeted content – over 3 million different homepages were generated for the news site – Yahoo! was able to increase the clickthrough rate for news by more than 300 per cent.

"Obviously human editors could not have done that," explained Papaioannou, which is why Yahoo! build a content serving engine that figured out what pages to serve by mining the data about the stories that users actually read each day and feeding them more of the right stuff to keep them reading.

The problem with all of this is that the Hadoop backend is that Hadoop is batch oriented. "Hadoop has been a fantastic platform for doing that," says Papaioannou. "But actually, the web is moving towards much more of a real-time experience, people are expecting much more of a real-time experience."

It doesn't matter if you can sift through mounds of data to find "the signal" buried in it that tells you what to do with and end user, says Papaioannou. You need to be able to act on that signal in real-time." And Yahoo! was not able to get Hadoop to run in real-time, despite the MapReduce Online and S4 efforts.

There has been an evolution from relational databases, which didn't scale very well, to sharded databases (like distributed MySQL) to try to move from enterprise to hyperscale Web application scales, explains Papaioannou, who threw up this quick graphic:

 Continuuity big data spectrum

The evolution of big data, according to Todd P and Continuuity (click to enlarge)

To get around the scalability barriers of the traditional relational database, you shatter the database and distribute it across a bunch of database nodes, which all fed into the compute node working on the data. With systems like Hadoop, you have a single master node that controls the MapReduce job that is crunching data, and you actually move the necessary compute jobs out to run on top of the data store nodes.

But in the future, what Papaioannou envisions is that companies will not wait to run their log files and clickstreams through a batch-oriented Hadoop cluster, but rather run the data through the compute nodes and process it all in real time, and presumably in a massively parallel fashion. He was not precise about how this might be accomplished.

"This is a pretty fundamental change compared to the application architecture of Hadoop," says Papaioannou. "You walk around here and you see a lot of people talking about real-time. It's not clear to me, as an industry, that we have nailed that problem. It is clear to me that we need to solve that problem, and that the next big wave of applications is going to be real-time and to get to real-time, you have to take the human out of the loop." Just like Yahoo! did with its homepage.

Presumably, Continuuity will have something to do with that. ®

Secure remote control for conventional and virtual desktops

More from The Register

next story
Azure TITSUP caused by INFINITE LOOP
Fat fingered geo-block kept Aussies in the dark
NASA launches new climate model at SC14
75 days of supercomputing later ...
Yahoo! blames! MONSTER! email! OUTAGE! on! CUT! CABLE! bungle!
Weekend woe for BT as telco struggles to restore service
You think the CLOUD's insecure? It's BETTER than UK.GOV's DATA CENTRES
We don't even know where some of them ARE – Maude
DEATH by COMMENTS: WordPress XSS vuln is BIGGEST for YEARS
Trio of XSS turns attackers into admins
BOFH: WHERE did this 'fax-enabled' printer UPGRADE come from?
Don't worry about that cable, it's part of the config
Cloud unicorns are extinct so DiData cloud mess was YOUR fault
Applications need to be built to handle TITSUP incidents
Astro-boffins start opening universe simulation data
Got a supercomputer? Want to simulate a universe? Here you go
prev story

Whitepapers

Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
How to determine if cloud backup is right for your servers
Two key factors, technical feasibility and TCO economics, that backup and IT operations managers should consider when assessing cloud backup.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Choosing a cloud hosting partner with confidence
Download Choosing a Cloud Hosting Provider with Confidence to learn more about cloud computing - the new opportunities and new security challenges.