Feeds

'Amazon can't do what we do': Twitter-miner's BYO data centre heresy

DataSift strains with Hadoop

Beginner's guide to SSL certificates

Hadoop strain

Halstead told The Reg: “Hadoop especially puts incredible strain on networking switches, and the cost and level of management required is beyond even the best data centres. We have trained Cisco operations people ourselves and as we grow the complexity of network infrastructure will require that we move completely away from 'managed' and to having full-time staff looking after our own hardware.

“The costs of just co-lo and managed at larger scales do not make sense. You can build a data centre for £1m and fit it out yourself and our monthly costs over the next year will mean that making that kind of investment to allow us the flexibility to build a network infrastructure that we need,” he said.

“Spending a few million on a data centre is not insignificant, but we have raised a lot of capital in the past and will in the future.”

The start-up model of today dictates that if you want compute and storage power, you put it on somebody else’s cloud – the days of the money men paying for your servers went out with the dot-com bust in 2001. Halstead, though, is confident the Twitter and Wikipedia business has the kinds of legs that will ensure the data centre costs aren’t just covered, but buried. DataSift has raised $15m from venture capitalists so far, and Halstead’s plan is to be a $1bn business in four-to-five years' time, he told us.

That $1bn will come from continued demand for data mining of the web, he says.

Halstead has been swimming in the tweet end of the data pool for years: he founded TweetMeme, which aggregates popular links to determine the most popular. It grew to have an audience of millions and invented the famous Retweet button – Retweet serves more than 30 billion clicks per month on Twitter.

DataSift began in RSS aggregation. It then moved into analysis, monitoring blogs, news feeds, YouTube, Amazon and comments on services like IMDB. Now, this year, it has added aggregation and analysis of Twitter and Wikipedia.

In February, DataSift announced a first of its kind: a service that allowed paying companies to search people’s tweets going back two years. DataSift swallows 250 million tweets a day, peaking at 120,000 tweets per second, and performs more than 250 million sentiment analyses, with sub-100 millisecond latency. Twitter partners with just one other company on tweet analysis.

Watching Twitter watching you

DataSift's clients are searching Twitter for variety of reasons, but mostly trend analysis and future predictions. They even include those in financial markets looking to find out what people are saying about potential stocks – 30 per cent of DataSift’s clients work in financial services. Using Twitter, advertisers in the media can see what people are really doing while watching TV and during ads breaks by monitoring hash tags, instead of relying on the time-honoured and flawed set-top-box approach.

Wikistats is a similar service, only it doesn't only look for trends, but also analyses edits and deletions as well as discerning the author’s IP. Halstead’s vision is to provide a stats platform that start-ups can build tools for.

DataSift has one major advantage: access to the Twitter firehose. It took the firm just three weeks to shift the entire, two-year corpus of Twitter to its servers. Not everybody has the time, the network capacity or even the money to pay for such a shift. Plus, with every tweet, the volume of information grows. Also, it helps that DataSift is one of just two companies granted access to the firehose by Twitter.

“We are doing something meaningful with big data,” Halstead said. “We have the technology, Hadoop and the data. A lot of companies come to us who want to buy data sets off us.”

Having its own data centre certainly won't hurt either. ®

Choosing a cloud hosting partner with confidence

More from The Register

next story
Azure TITSUP caused by INFINITE LOOP
Fat fingered geo-block kept Aussies in the dark
NASA launches new climate model at SC14
75 days of supercomputing later ...
Yahoo! blames! MONSTER! email! OUTAGE! on! CUT! CABLE! bungle!
Weekend woe for BT as telco struggles to restore service
You think the CLOUD's insecure? It's BETTER than UK.GOV's DATA CENTRES
We don't even know where some of them ARE – Maude
DEATH by COMMENTS: WordPress XSS vuln is BIGGEST for YEARS
Trio of XSS turns attackers into admins
Cloud unicorns are extinct so DiData cloud mess was YOUR fault
Applications need to be built to handle TITSUP incidents
BOFH: WHERE did this 'fax-enabled' printer UPGRADE come from?
Don't worry about that cable, it's part of the config
Astro-boffins start opening universe simulation data
Got a supercomputer? Want to simulate a universe? Here you go
prev story

Whitepapers

Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
5 critical considerations for enterprise cloud backup
Key considerations when evaluating cloud backup solutions to ensure adequate protection security and availability of enterprise data.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Getting ahead of the compliance curve
Learn about new services that make it easy to discover and manage certificates across the enterprise and how to get ahead of the compliance curve.