We don't want your crap databases, says Twitter: We've made OUR OWN

Secret weapon whipped out of Manhattan Project – and we've taken a closer look

Application security programs and practises

Exclusive Twitter is growing up and, like an adult, is beginning to desire consistent guarantees about its data rather than instant availability.

At least that's the emphasis placed by the company on its new "Manhattan" data management software, a bedrock storage system that was revealed in a blog post on Wednesday.

What's not mentioned is that Twitter is developing something it calls secondary indexes, The Register can reveal. This is a potentially powerful feature within Manhattan that will give its employees greater flexibility over how they search through the social network's vast stores of data, arming the publicly listed company with another weapon for its commercial endeavors.

First, here's an overview of what Twitter told the world this week.

"Over the last few years, we found ourselves in need of a storage system that could serve millions of queries per second, with extremely low latency in a real-time environment. Availability and speed of the system became the utmost important factor. Not only did it need to be fast; it needed to be scalable across several regions around the world," Twitter wrote on its website.

Manhattan is software built by the social network's engineers to cope with the roughly 6,000 tweets that flood into its system every second. Though 6,000 tweets is not that much data, the messages hold a lot of complexity, as Manhattan also needs to handle the mesh of replies and retweets per tweet – a tricky problem when a celeb with millions of followers makes a remark and is immediately inundated with responses.

With this technology, which may eventually become open source, Twitter has been able to move away from a prior system that used Cassandra for eventual consistency and additional tools for strong consistency to a single, hulking system that does both. Manhattan has been in production for over a year, we understand.

Developers can select the consistency of data when reading from or writing to Manhattan, allowing them to create new services with varying tradeoffs between availability (how quickly something can be accessed) and consistency (how sure you are of the results of a query).

Because of this, Twitter's programmers can access a "Strong Consistency service" which uses a consensus algorithm paired with a replicated log to make sure that in-order events reach replicates.

So far, Twitter offers LOCAL_CAS (strong consistency within a single data center) and GLOBAL_CAS (strong consistency across multiple facilities). These will have "different tradeoffs when it comes to latency and data modeling for the application," Twitter noted in a blog post on Manhattan.

We have to go deeper

Data from the system is stored on three different systems: seadb is a read-only file format, sstable is a log-structured merge tree for heavy-write workloads and btree is a heavy-read and light-write system. The Manhattan "Core" system then decides whether to place information on spinning disks, memory, or solid-state flash (SSDs).

Manhattan can match incoming data to its most appropriate format. The output of Hadoop workloads, for example, can be fed into Manhattan from an Hadoop File System, and the software will transform that information into seadb files "so they can then be imported into the cluster for fast serving from SSDs or memory," Twitter explains.

The database supports multi-tenancy and contains its own rate-limiting service to stop Twitter's developers flooding the system with requests. These systems are wrapped in a front-end that, Twitter says, gives its engineers access to a "self-service" storage system.

"Engineers can provision what their application needs (storage size, queries per second, etc) and start using storage in seconds without having to wait for hardware to be installed or for schemas to be set up," Twitter wrote, describing the system.

Twitter has plans to publish a white paper outlining the technology in the future, and may even publish the database as open source. The latter may take a while though, as one former Twitter engineer said in a post to an online message board: "It's got a lot of moving parts and internal integrations, however, so it's got to be a ton of work to make open."

The future? Secondary indexes

As for further development, the company is working to implement secondary indexes, we've discovered.

Secondary indexes let developers add an additional set of range keys to the index for a database, dramatically speeding up the way developers can navigate through and ask questions of large amounts of data.

Amazon Web Services's major DynamoDB product, for instance, implements this technology in a local and global (multi-data center) format.

By adding secondary indexes into Manhattan, Twitter will give its developers the ability to write more sophisticated queries against their large data sets, the indexes of which will be stored on memory.

This means, for instance, that Twitter's commercial arm could build a fast ad system that could present adverts in near-real-time to people according to a multitude of different factors at a lower cost and higher level of flexibility than before.

Systems like secondary indexes will be crucial to the rollout of more advanced, granular, advertising display options, and are therefore a critical tool in Twitter's commercial arsenal.

There are challenges, though: secondary indexes will initially be built to offer consistency just locally within a data center, because generating global secondary indexes is computationally impractical. ®

Bridging the IT gap between rising business demands and ageing tools

More from The Register

next story
Apple fanbois SCREAM as update BRICKS their Macbook Airs
Ragegasm spills over as firmware upgrade kills machines
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
Attack of the clones: Oracle's latest Red Hat Linux lookalike arrives
Oracle's Linux boss says Larry's Linux isn't just for Oracle apps anymore
THUD! WD plonks down SIX TERABYTE 'consumer NAS' fatboy
Now that's a LOT of porn or pirated movies. Or, you know, other consumer stuff
EU's top data cops to meet Google, Microsoft et al over 'right to be forgotten'
Plan to hammer out 'coherent' guidelines. Good luck chaps!
US judge: YES, cops or feds so can slurp an ENTIRE Gmail account
Crooks don't have folders labelled 'drug records', opines NY beak
Manic malware Mayhem spreads through Linux, FreeBSD web servers
And how Google could cripple infection rate in a second
prev story


Top three mobile application threats
Prevent sensitive data leakage over insecure channels or stolen mobile devices.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Application security programs and practises
Follow a few strategies and your organization can gain the full benefits of open source and the cloud without compromising the security of your applications.
The Essential Guide to IT Transformation
ServiceNow discusses three IT transformations that can help CIO's automate IT services to transform IT and the enterprise.