We don't want your crap databases, says Twitter: We've made OUR OWN

Secret weapon whipped out of Manhattan Project – and we've taken a closer look

Internet Security Threat Report 2014

Exclusive Twitter is growing up and, like an adult, is beginning to desire consistent guarantees about its data rather than instant availability.

At least that's the emphasis placed by the company on its new "Manhattan" data management software, a bedrock storage system that was revealed in a blog post on Wednesday.

What's not mentioned is that Twitter is developing something it calls secondary indexes, The Register can reveal. This is a potentially powerful feature within Manhattan that will give its employees greater flexibility over how they search through the social network's vast stores of data, arming the publicly listed company with another weapon for its commercial endeavors.

First, here's an overview of what Twitter told the world this week.

"Over the last few years, we found ourselves in need of a storage system that could serve millions of queries per second, with extremely low latency in a real-time environment. Availability and speed of the system became the utmost important factor. Not only did it need to be fast; it needed to be scalable across several regions around the world," Twitter wrote on its website.

Manhattan is software built by the social network's engineers to cope with the roughly 6,000 tweets that flood into its system every second. Though 6,000 tweets is not that much data, the messages hold a lot of complexity, as Manhattan also needs to handle the mesh of replies and retweets per tweet – a tricky problem when a celeb with millions of followers makes a remark and is immediately inundated with responses.

With this technology, which may eventually become open source, Twitter has been able to move away from a prior system that used Cassandra for eventual consistency and additional tools for strong consistency to a single, hulking system that does both. Manhattan has been in production for over a year, we understand.

Developers can select the consistency of data when reading from or writing to Manhattan, allowing them to create new services with varying tradeoffs between availability (how quickly something can be accessed) and consistency (how sure you are of the results of a query).

Because of this, Twitter's programmers can access a "Strong Consistency service" which uses a consensus algorithm paired with a replicated log to make sure that in-order events reach replicates.

So far, Twitter offers LOCAL_CAS (strong consistency within a single data center) and GLOBAL_CAS (strong consistency across multiple facilities). These will have "different tradeoffs when it comes to latency and data modeling for the application," Twitter noted in a blog post on Manhattan.

We have to go deeper

Data from the system is stored on three different systems: seadb is a read-only file format, sstable is a log-structured merge tree for heavy-write workloads and btree is a heavy-read and light-write system. The Manhattan "Core" system then decides whether to place information on spinning disks, memory, or solid-state flash (SSDs).

Manhattan can match incoming data to its most appropriate format. The output of Hadoop workloads, for example, can be fed into Manhattan from an Hadoop File System, and the software will transform that information into seadb files "so they can then be imported into the cluster for fast serving from SSDs or memory," Twitter explains.

The database supports multi-tenancy and contains its own rate-limiting service to stop Twitter's developers flooding the system with requests. These systems are wrapped in a front-end that, Twitter says, gives its engineers access to a "self-service" storage system.

"Engineers can provision what their application needs (storage size, queries per second, etc) and start using storage in seconds without having to wait for hardware to be installed or for schemas to be set up," Twitter wrote, describing the system.

Twitter has plans to publish a white paper outlining the technology in the future, and may even publish the database as open source. The latter may take a while though, as one former Twitter engineer said in a post to an online message board: "It's got a lot of moving parts and internal integrations, however, so it's got to be a ton of work to make open."

The future? Secondary indexes

As for further development, the company is working to implement secondary indexes, we've discovered.

Secondary indexes let developers add an additional set of range keys to the index for a database, dramatically speeding up the way developers can navigate through and ask questions of large amounts of data.

Amazon Web Services's major DynamoDB product, for instance, implements this technology in a local and global (multi-data center) format.

By adding secondary indexes into Manhattan, Twitter will give its developers the ability to write more sophisticated queries against their large data sets, the indexes of which will be stored on memory.

This means, for instance, that Twitter's commercial arm could build a fast ad system that could present adverts in near-real-time to people according to a multitude of different factors at a lower cost and higher level of flexibility than before.

Systems like secondary indexes will be crucial to the rollout of more advanced, granular, advertising display options, and are therefore a critical tool in Twitter's commercial arsenal.

There are challenges, though: secondary indexes will initially be built to offer consistency just locally within a data center, because generating global secondary indexes is computationally impractical. ®

Beginner's guide to SSL certificates

More from The Register

next story
Docker's app containers are coming to Windows Server, says Microsoft
MS chases app deployment speeds already enjoyed by Linux devs
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
'Urika': Cray unveils new 1,500-core big data crunching monster
6TB of DRAM, 38TB of SSD flash and 120TB of disk storage
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
SDI wars: WTF is software defined infrastructure?
This time we play for ALL the marbles
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
Oracle hires former SAP exec for cloudy push
'We know Larry said cloud was gibberish, and insane, and idiotic, but...'
Symantec backs out of Backup Exec: Plans to can appliance in Jan
Will still provide support to existing customers
prev story


Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Win a year’s supply of chocolate
There is no techie angle to this competition so we're not going to pretend there is, but everyone loves chocolate so who cares.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.