We don't want your crap databases, says Twitter: We've made OUR OWN

Secret weapon whipped out of Manhattan Project – and we've taken a closer look

Next gen security for virtualised datacentres

Exclusive Twitter is growing up and, like an adult, is beginning to desire consistent guarantees about its data rather than instant availability.

At least that's the emphasis placed by the company on its new "Manhattan" data management software, a bedrock storage system that was revealed in a blog post on Wednesday.

What's not mentioned is that Twitter is developing something it calls secondary indexes, The Register can reveal. This is a potentially powerful feature within Manhattan that will give its employees greater flexibility over how they search through the social network's vast stores of data, arming the publicly listed company with another weapon for its commercial endeavors.

First, here's an overview of what Twitter told the world this week.

"Over the last few years, we found ourselves in need of a storage system that could serve millions of queries per second, with extremely low latency in a real-time environment. Availability and speed of the system became the utmost important factor. Not only did it need to be fast; it needed to be scalable across several regions around the world," Twitter wrote on its website.

Manhattan is software built by the social network's engineers to cope with the roughly 6,000 tweets that flood into its system every second. Though 6,000 tweets is not that much data, the messages hold a lot of complexity, as Manhattan also needs to handle the mesh of replies and retweets per tweet – a tricky problem when a celeb with millions of followers makes a remark and is immediately inundated with responses.

With this technology, which may eventually become open source, Twitter has been able to move away from a prior system that used Cassandra for eventual consistency and additional tools for strong consistency to a single, hulking system that does both. Manhattan has been in production for over a year, we understand.

Developers can select the consistency of data when reading from or writing to Manhattan, allowing them to create new services with varying tradeoffs between availability (how quickly something can be accessed) and consistency (how sure you are of the results of a query).

Because of this, Twitter's programmers can access a "Strong Consistency service" which uses a consensus algorithm paired with a replicated log to make sure that in-order events reach replicates.

So far, Twitter offers LOCAL_CAS (strong consistency within a single data center) and GLOBAL_CAS (strong consistency across multiple facilities). These will have "different tradeoffs when it comes to latency and data modeling for the application," Twitter noted in a blog post on Manhattan.

We have to go deeper

Data from the system is stored on three different systems: seadb is a read-only file format, sstable is a log-structured merge tree for heavy-write workloads and btree is a heavy-read and light-write system. The Manhattan "Core" system then decides whether to place information on spinning disks, memory, or solid-state flash (SSDs).

Manhattan can match incoming data to its most appropriate format. The output of Hadoop workloads, for example, can be fed into Manhattan from an Hadoop File System, and the software will transform that information into seadb files "so they can then be imported into the cluster for fast serving from SSDs or memory," Twitter explains.

The database supports multi-tenancy and contains its own rate-limiting service to stop Twitter's developers flooding the system with requests. These systems are wrapped in a front-end that, Twitter says, gives its engineers access to a "self-service" storage system.

"Engineers can provision what their application needs (storage size, queries per second, etc) and start using storage in seconds without having to wait for hardware to be installed or for schemas to be set up," Twitter wrote, describing the system.

Twitter has plans to publish a white paper outlining the technology in the future, and may even publish the database as open source. The latter may take a while though, as one former Twitter engineer said in a post to an online message board: "It's got a lot of moving parts and internal integrations, however, so it's got to be a ton of work to make open."

The future? Secondary indexes

As for further development, the company is working to implement secondary indexes, we've discovered.

Secondary indexes let developers add an additional set of range keys to the index for a database, dramatically speeding up the way developers can navigate through and ask questions of large amounts of data.

Amazon Web Services's major DynamoDB product, for instance, implements this technology in a local and global (multi-data center) format.

By adding secondary indexes into Manhattan, Twitter will give its developers the ability to write more sophisticated queries against their large data sets, the indexes of which will be stored on memory.

This means, for instance, that Twitter's commercial arm could build a fast ad system that could present adverts in near-real-time to people according to a multitude of different factors at a lower cost and higher level of flexibility than before.

Systems like secondary indexes will be crucial to the rollout of more advanced, granular, advertising display options, and are therefore a critical tool in Twitter's commercial arsenal.

There are challenges, though: secondary indexes will initially be built to offer consistency just locally within a data center, because generating global secondary indexes is computationally impractical. ®

The essential guide to IT transformation

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Death by 1,000 cuts: Mainstream storage array suppliers are bleeding
Cloud, all-flash kit, object storage slicing away at titans of storage
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
VMware vaporises vCHS hybrid cloud service
AnD yEt mOre cRazy cAps to dEal wIth
El Reg's virtualisation desk pulls out the VMworld crystal ball
MARVIN musings and other Gelsinger Gang guessing games
prev story


Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
7 Elements of Radically Simple OS Migration
Avoid the typical headaches of OS migration during your next project by learning about 7 elements of radically simple OS migration.
BYOD's dark side: Data protection
An endpoint data protection solution that adds value to the user and the organization so it can protect itself from data loss as well as leverage corporate data.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?