'Amazon can't do what we do': Twitter-miner's BYO data centre heresy
DataSift strains with Hadoop
Sometimes floating on somebody else’s cloud isn’t enough. Sometimes you just have to float alone – no matter how young you are. DataSift, the five-year-old big data company mining billions of tweets and Wikipedia edits, reckons it’s just one year away from building its own data centre.
DataSift sucks down 2TB of data from Twitter each day while it has two-and-a-half years' worth of Twitter data – 90 billion tweets – sitting on Hadoop servers. DataSift has also launched Wikistats, tracking trends on Jimmy "Jimbo" Wales’ crowd-surfing site. Wikistats records edits, peaking at up to 100 a second.
Nick Halstead, DataSift's founder and chief technology officer, reckons the cost and complexity of his current co-located and mixed set-up, means a data centre is on the cards – and soon. He ruled out a move to using a public cloud option, based on performance and cost.
“You can’t run what we run on Amazon from a cost and performance perspective,” he told The Reg during an interview.
DataSift wouldn’t be the first company working at what’s called “web scale” to build its own data centre, but it is possibly the youngest, the smallest (30 employees) and probably the only tech venture in today’s environment doing so with the potential assistance of venture capital.
Facebook was founded in 2004 and has just spent hundreds of millions building its own centres in Oregon, North Carolina and Sweden, although it still uses third parties in California and Virginia. Twitter, founded in 2006 last year, picked Utah for its first data centre. eBay, hailing from the dot-com era, is building a $287m data centre, also in Utah.
But why would they do this, when those pushing public clouds – such as Salesforce – are so emphatic that in this era of cheap and (ahem, Amazon) reliable data centres, building your own no longer makes financial or organisational sense?
Owning your own can mean lower costs in the long run with access to cheaper power, custom designed cooling and servers, and abundant capacity for expansion.
In DataSift’s case, it also means consolidation and sanity, with a potentially simpler network infrastructure that comes at a lower cost.
DataSift has its own 10 Hewlett-Packard racks plus 240 Dell racks run by Pulsant at two data centres in Reading, near Microsoft. The servers have 936 CPU cores and data filtering nodes can process up to 10,000 unique streams to keep up with what’s being said and deliver results.
Halstead has additional racks in reserve, ready to deploy, but reckons he already spends “a lot” of money on hardware. The real problem, Halstead says, isn’t the cost of rack space but what he calls “very complex” networking. DataSift uses the open-source Java Hadoop framework to process and serve terabytes of tweets and Wiki updates across its distributed, clustered servers. Hadoop means speed, but it’s never been a pushover to install and administer, as founder Doug Cutting told us here.
Halstead told The Reg: “Hadoop especially puts incredible strain on networking switches, and the cost and level of management required is beyond even the best data centres. We have trained Cisco operations people ourselves and as we grow the complexity of network infrastructure will require that we move completely away from 'managed' and to having full-time staff looking after our own hardware.
“The costs of just co-lo and managed at larger scales do not make sense. You can build a data centre for £1m and fit it out yourself and our monthly costs over the next year will mean that making that kind of investment to allow us the flexibility to build a network infrastructure that we need,” he said.
“Spending a few million on a data centre is not insignificant, but we have raised a lot of capital in the past and will in the future.”
The start-up model of today dictates that if you want compute and storage power, you put it on somebody else’s cloud – the days of the money men paying for your servers went out with the dot-com bust in 2001. Halstead, though, is confident the Twitter and Wikipedia business has the kinds of legs that will ensure the data centre costs aren’t just covered, but buried. DataSift has raised $15m from venture capitalists so far, and Halstead’s plan is to be a $1bn business in four-to-five years' time, he told us.
That $1bn will come from continued demand for data mining of the web, he says.
Halstead has been swimming in the tweet end of the data pool for years: he founded TweetMeme, which aggregates popular links to determine the most popular. It grew to have an audience of millions and invented the famous Retweet button – Retweet serves more than 30 billion clicks per month on Twitter.
DataSift began in RSS aggregation. It then moved into analysis, monitoring blogs, news feeds, YouTube, Amazon and comments on services like IMDB. Now, this year, it has added aggregation and analysis of Twitter and Wikipedia.
In February, DataSift announced a first of its kind: a service that allowed paying companies to search people’s tweets going back two years. DataSift swallows 250 million tweets a day, peaking at 120,000 tweets per second, and performs more than 250 million sentiment analyses, with sub-100 millisecond latency. Twitter partners with just one other company on tweet analysis.
Watching Twitter watching you
DataSift's clients are searching Twitter for variety of reasons, but mostly trend analysis and future predictions. They even include those in financial markets looking to find out what people are saying about potential stocks – 30 per cent of DataSift’s clients work in financial services. Using Twitter, advertisers in the media can see what people are really doing while watching TV and during ads breaks by monitoring hash tags, instead of relying on the time-honoured and flawed set-top-box approach.
Wikistats is a similar service, only it doesn't only look for trends, but also analyses edits and deletions as well as discerning the author’s IP. Halstead’s vision is to provide a stats platform that start-ups can build tools for.
DataSift has one major advantage: access to the Twitter firehose. It took the firm just three weeks to shift the entire, two-year corpus of Twitter to its servers. Not everybody has the time, the network capacity or even the money to pay for such a shift. Plus, with every tweet, the volume of information grows. Also, it helps that DataSift is one of just two companies granted access to the firehose by Twitter.
“We are doing something meaningful with big data,” Halstead said. “We have the technology, Hadoop and the data. A lot of companies come to us who want to buy data sets off us.”
Having its own data centre certainly won't hurt either. ®