'Amazon can't do what we do': Twitter-miner's BYO data centre heresy
DataSift strains with Hadoop
Halstead told The Reg: “Hadoop especially puts incredible strain on networking switches, and the cost and level of management required is beyond even the best data centres. We have trained Cisco operations people ourselves and as we grow the complexity of network infrastructure will require that we move completely away from 'managed' and to having full-time staff looking after our own hardware.
“The costs of just co-lo and managed at larger scales do not make sense. You can build a data centre for £1m and fit it out yourself and our monthly costs over the next year will mean that making that kind of investment to allow us the flexibility to build a network infrastructure that we need,” he said.
“Spending a few million on a data centre is not insignificant, but we have raised a lot of capital in the past and will in the future.”
The start-up model of today dictates that if you want compute and storage power, you put it on somebody else’s cloud – the days of the money men paying for your servers went out with the dot-com bust in 2001. Halstead, though, is confident the Twitter and Wikipedia business has the kinds of legs that will ensure the data centre costs aren’t just covered, but buried. DataSift has raised $15m from venture capitalists so far, and Halstead’s plan is to be a $1bn business in four-to-five years' time, he told us.
That $1bn will come from continued demand for data mining of the web, he says.
Halstead has been swimming in the tweet end of the data pool for years: he founded TweetMeme, which aggregates popular links to determine the most popular. It grew to have an audience of millions and invented the famous Retweet button – Retweet serves more than 30 billion clicks per month on Twitter.
DataSift began in RSS aggregation. It then moved into analysis, monitoring blogs, news feeds, YouTube, Amazon and comments on services like IMDB. Now, this year, it has added aggregation and analysis of Twitter and Wikipedia.
In February, DataSift announced a first of its kind: a service that allowed paying companies to search people’s tweets going back two years. DataSift swallows 250 million tweets a day, peaking at 120,000 tweets per second, and performs more than 250 million sentiment analyses, with sub-100 millisecond latency. Twitter partners with just one other company on tweet analysis.
Watching Twitter watching you
DataSift's clients are searching Twitter for variety of reasons, but mostly trend analysis and future predictions. They even include those in financial markets looking to find out what people are saying about potential stocks – 30 per cent of DataSift’s clients work in financial services. Using Twitter, advertisers in the media can see what people are really doing while watching TV and during ads breaks by monitoring hash tags, instead of relying on the time-honoured and flawed set-top-box approach.
Wikistats is a similar service, only it doesn't only look for trends, but also analyses edits and deletions as well as discerning the author’s IP. Halstead’s vision is to provide a stats platform that start-ups can build tools for.
DataSift has one major advantage: access to the Twitter firehose. It took the firm just three weeks to shift the entire, two-year corpus of Twitter to its servers. Not everybody has the time, the network capacity or even the money to pay for such a shift. Plus, with every tweet, the volume of information grows. Also, it helps that DataSift is one of just two companies granted access to the firehose by Twitter.
“We are doing something meaningful with big data,” Halstead said. “We have the technology, Hadoop and the data. A lot of companies come to us who want to buy data sets off us.”
Having its own data centre certainly won't hurt either. ®