Feeds

'Amazon can't do what we do': Twitter-miner's BYO data centre heresy

DataSift strains with Hadoop

The Essential Guide to IT Transformation

Hadoop strain

Halstead told The Reg: “Hadoop especially puts incredible strain on networking switches, and the cost and level of management required is beyond even the best data centres. We have trained Cisco operations people ourselves and as we grow the complexity of network infrastructure will require that we move completely away from 'managed' and to having full-time staff looking after our own hardware.

“The costs of just co-lo and managed at larger scales do not make sense. You can build a data centre for £1m and fit it out yourself and our monthly costs over the next year will mean that making that kind of investment to allow us the flexibility to build a network infrastructure that we need,” he said.

“Spending a few million on a data centre is not insignificant, but we have raised a lot of capital in the past and will in the future.”

The start-up model of today dictates that if you want compute and storage power, you put it on somebody else’s cloud – the days of the money men paying for your servers went out with the dot-com bust in 2001. Halstead, though, is confident the Twitter and Wikipedia business has the kinds of legs that will ensure the data centre costs aren’t just covered, but buried. DataSift has raised $15m from venture capitalists so far, and Halstead’s plan is to be a $1bn business in four-to-five years' time, he told us.

That $1bn will come from continued demand for data mining of the web, he says.

Halstead has been swimming in the tweet end of the data pool for years: he founded TweetMeme, which aggregates popular links to determine the most popular. It grew to have an audience of millions and invented the famous Retweet button – Retweet serves more than 30 billion clicks per month on Twitter.

DataSift began in RSS aggregation. It then moved into analysis, monitoring blogs, news feeds, YouTube, Amazon and comments on services like IMDB. Now, this year, it has added aggregation and analysis of Twitter and Wikipedia.

In February, DataSift announced a first of its kind: a service that allowed paying companies to search people’s tweets going back two years. DataSift swallows 250 million tweets a day, peaking at 120,000 tweets per second, and performs more than 250 million sentiment analyses, with sub-100 millisecond latency. Twitter partners with just one other company on tweet analysis.

Watching Twitter watching you

DataSift's clients are searching Twitter for variety of reasons, but mostly trend analysis and future predictions. They even include those in financial markets looking to find out what people are saying about potential stocks – 30 per cent of DataSift’s clients work in financial services. Using Twitter, advertisers in the media can see what people are really doing while watching TV and during ads breaks by monitoring hash tags, instead of relying on the time-honoured and flawed set-top-box approach.

Wikistats is a similar service, only it doesn't only look for trends, but also analyses edits and deletions as well as discerning the author’s IP. Halstead’s vision is to provide a stats platform that start-ups can build tools for.

DataSift has one major advantage: access to the Twitter firehose. It took the firm just three weeks to shift the entire, two-year corpus of Twitter to its servers. Not everybody has the time, the network capacity or even the money to pay for such a shift. Plus, with every tweet, the volume of information grows. Also, it helps that DataSift is one of just two companies granted access to the firehose by Twitter.

“We are doing something meaningful with big data,” Halstead said. “We have the technology, Hadoop and the data. A lot of companies come to us who want to buy data sets off us.”

Having its own data centre certainly won't hurt either. ®

The Essential Guide to IT Transformation

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
VMware builds product executables on 50 Mac Minis
And goes to the Genius Bar for support
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
Multipath TCP speeds up the internet so much that security breaks
Black Hat research says proposed protocol will bork network probes, flummox firewalls
Microsoft says 'weird things' can happen during Windows Server 2003 migrations
Fix coming for bug that makes Kerberos croak when you run two domain controllers
Cisco says network virtualisation won't pay off everywhere
Another sign of strain in the Borg/VMware relationship?
Forrester says Australia, not China, is next boom market for cloud
It's cloudy but fine down under, analyst says
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
The Essential Guide to IT Transformation
ServiceNow discusses three IT transformations that can help CIO's automate IT services to transform IT and the enterprise.
Maximize storage efficiency across the enterprise
The HP StoreOnce backup solution offers highly flexible, centrally managed, and highly efficient data protection for any enterprise.