Feeds

'Amazon can't do what we do': Twitter-miner's BYO data centre heresy

DataSift strains with Hadoop

Top 5 reasons to deploy VMware with Tegile

Hadoop strain

Halstead told The Reg: “Hadoop especially puts incredible strain on networking switches, and the cost and level of management required is beyond even the best data centres. We have trained Cisco operations people ourselves and as we grow the complexity of network infrastructure will require that we move completely away from 'managed' and to having full-time staff looking after our own hardware.

“The costs of just co-lo and managed at larger scales do not make sense. You can build a data centre for £1m and fit it out yourself and our monthly costs over the next year will mean that making that kind of investment to allow us the flexibility to build a network infrastructure that we need,” he said.

“Spending a few million on a data centre is not insignificant, but we have raised a lot of capital in the past and will in the future.”

The start-up model of today dictates that if you want compute and storage power, you put it on somebody else’s cloud – the days of the money men paying for your servers went out with the dot-com bust in 2001. Halstead, though, is confident the Twitter and Wikipedia business has the kinds of legs that will ensure the data centre costs aren’t just covered, but buried. DataSift has raised $15m from venture capitalists so far, and Halstead’s plan is to be a $1bn business in four-to-five years' time, he told us.

That $1bn will come from continued demand for data mining of the web, he says.

Halstead has been swimming in the tweet end of the data pool for years: he founded TweetMeme, which aggregates popular links to determine the most popular. It grew to have an audience of millions and invented the famous Retweet button – Retweet serves more than 30 billion clicks per month on Twitter.

DataSift began in RSS aggregation. It then moved into analysis, monitoring blogs, news feeds, YouTube, Amazon and comments on services like IMDB. Now, this year, it has added aggregation and analysis of Twitter and Wikipedia.

In February, DataSift announced a first of its kind: a service that allowed paying companies to search people’s tweets going back two years. DataSift swallows 250 million tweets a day, peaking at 120,000 tweets per second, and performs more than 250 million sentiment analyses, with sub-100 millisecond latency. Twitter partners with just one other company on tweet analysis.

Watching Twitter watching you

DataSift's clients are searching Twitter for variety of reasons, but mostly trend analysis and future predictions. They even include those in financial markets looking to find out what people are saying about potential stocks – 30 per cent of DataSift’s clients work in financial services. Using Twitter, advertisers in the media can see what people are really doing while watching TV and during ads breaks by monitoring hash tags, instead of relying on the time-honoured and flawed set-top-box approach.

Wikistats is a similar service, only it doesn't only look for trends, but also analyses edits and deletions as well as discerning the author’s IP. Halstead’s vision is to provide a stats platform that start-ups can build tools for.

DataSift has one major advantage: access to the Twitter firehose. It took the firm just three weeks to shift the entire, two-year corpus of Twitter to its servers. Not everybody has the time, the network capacity or even the money to pay for such a shift. Plus, with every tweet, the volume of information grows. Also, it helps that DataSift is one of just two companies granted access to the firehose by Twitter.

“We are doing something meaningful with big data,” Halstead said. “We have the technology, Hadoop and the data. A lot of companies come to us who want to buy data sets off us.”

Having its own data centre certainly won't hurt either. ®

Beginner's guide to SSL certificates

More from The Register

next story
It's Big, it's Blue... it's simply FABLESS! IBM's chip-free future
Or why the reversal of globalisation ain't gonna 'appen
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
Microsoft and Dell’s cloud in a box: Instant Azure for the data centre
A less painful way to run Microsoft’s private cloud
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
CAGE MATCH: Microsoft, Dell open co-located bit barns in Oz
Whole new species of XaaS spawning in the antipodes
AWS pulls desktop-as-a-service from the PC
Support for PCoIP protocol means zero clients can run cloudy desktops
prev story

Whitepapers

Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.