We don't want your crap databases, says Twitter: We've made OUR OWN

Secret weapon whipped out of Manhattan Project – and we've taken a closer look

Top 5 reasons to deploy VMware with Tegile

Exclusive Twitter is growing up and, like an adult, is beginning to desire consistent guarantees about its data rather than instant availability.

At least that's the emphasis placed by the company on its new "Manhattan" data management software, a bedrock storage system that was revealed in a blog post on Wednesday.

What's not mentioned is that Twitter is developing something it calls secondary indexes, The Register can reveal. This is a potentially powerful feature within Manhattan that will give its employees greater flexibility over how they search through the social network's vast stores of data, arming the publicly listed company with another weapon for its commercial endeavors.

First, here's an overview of what Twitter told the world this week.

"Over the last few years, we found ourselves in need of a storage system that could serve millions of queries per second, with extremely low latency in a real-time environment. Availability and speed of the system became the utmost important factor. Not only did it need to be fast; it needed to be scalable across several regions around the world," Twitter wrote on its website.

Manhattan is software built by the social network's engineers to cope with the roughly 6,000 tweets that flood into its system every second. Though 6,000 tweets is not that much data, the messages hold a lot of complexity, as Manhattan also needs to handle the mesh of replies and retweets per tweet – a tricky problem when a celeb with millions of followers makes a remark and is immediately inundated with responses.

With this technology, which may eventually become open source, Twitter has been able to move away from a prior system that used Cassandra for eventual consistency and additional tools for strong consistency to a single, hulking system that does both. Manhattan has been in production for over a year, we understand.

Developers can select the consistency of data when reading from or writing to Manhattan, allowing them to create new services with varying tradeoffs between availability (how quickly something can be accessed) and consistency (how sure you are of the results of a query).

Because of this, Twitter's programmers can access a "Strong Consistency service" which uses a consensus algorithm paired with a replicated log to make sure that in-order events reach replicates.

So far, Twitter offers LOCAL_CAS (strong consistency within a single data center) and GLOBAL_CAS (strong consistency across multiple facilities). These will have "different tradeoffs when it comes to latency and data modeling for the application," Twitter noted in a blog post on Manhattan.

We have to go deeper

Data from the system is stored on three different systems: seadb is a read-only file format, sstable is a log-structured merge tree for heavy-write workloads and btree is a heavy-read and light-write system. The Manhattan "Core" system then decides whether to place information on spinning disks, memory, or solid-state flash (SSDs).

Manhattan can match incoming data to its most appropriate format. The output of Hadoop workloads, for example, can be fed into Manhattan from an Hadoop File System, and the software will transform that information into seadb files "so they can then be imported into the cluster for fast serving from SSDs or memory," Twitter explains.

The database supports multi-tenancy and contains its own rate-limiting service to stop Twitter's developers flooding the system with requests. These systems are wrapped in a front-end that, Twitter says, gives its engineers access to a "self-service" storage system.

"Engineers can provision what their application needs (storage size, queries per second, etc) and start using storage in seconds without having to wait for hardware to be installed or for schemas to be set up," Twitter wrote, describing the system.

Twitter has plans to publish a white paper outlining the technology in the future, and may even publish the database as open source. The latter may take a while though, as one former Twitter engineer said in a post to an online message board: "It's got a lot of moving parts and internal integrations, however, so it's got to be a ton of work to make open."

The future? Secondary indexes

As for further development, the company is working to implement secondary indexes, we've discovered.

Secondary indexes let developers add an additional set of range keys to the index for a database, dramatically speeding up the way developers can navigate through and ask questions of large amounts of data.

Amazon Web Services's major DynamoDB product, for instance, implements this technology in a local and global (multi-data center) format.

By adding secondary indexes into Manhattan, Twitter will give its developers the ability to write more sophisticated queries against their large data sets, the indexes of which will be stored on memory.

This means, for instance, that Twitter's commercial arm could build a fast ad system that could present adverts in near-real-time to people according to a multitude of different factors at a lower cost and higher level of flexibility than before.

Systems like secondary indexes will be crucial to the rollout of more advanced, granular, advertising display options, and are therefore a critical tool in Twitter's commercial arsenal.

There are challenges, though: secondary indexes will initially be built to offer consistency just locally within a data center, because generating global secondary indexes is computationally impractical. ®

Choosing a cloud hosting partner with confidence

More from The Register

next story
Fat fingered geo-block kept Aussies in the dark
NASA launches new climate model at SC14
75 days of supercomputing later ...
Yahoo! blames! MONSTER! email! OUTAGE! on! CUT! CABLE! bungle!
Weekend woe for BT as telco struggles to restore service
You think the CLOUD's insecure? It's BETTER than UK.GOV's DATA CENTRES
We don't even know where some of them ARE – Maude
Trio of XSS turns attackers into admins
Cloud unicorns are extinct so DiData cloud mess was YOUR fault
Applications need to be built to handle TITSUP incidents
BOFH: WHERE did this 'fax-enabled' printer UPGRADE come from?
Don't worry about that cable, it's part of the config
Astro-boffins start opening universe simulation data
Got a supercomputer? Want to simulate a universe? Here you go
prev story


Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
5 critical considerations for enterprise cloud backup
Key considerations when evaluating cloud backup solutions to ensure adequate protection security and availability of enterprise data.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Choosing a cloud hosting partner with confidence
Download Choosing a Cloud Hosting Provider with Confidence to learn more about cloud computing - the new opportunities and new security challenges.