Apache Cassandra at 10: Making a community believe in NoSQL
A decade of technical promise and open-source fall-outs
Ten years ago this month, when Lehman Brothers was still just about in business and the term NoSQL wasn't even widely known, let alone an irritant, Facebook engineers open-sourced a distributed database system named Cassandra.
Back then, the idea that huge numbers of companies would need a scalable database was almost laughable – and that grip of traditional relational database systems is reflected in the mythical moniker given to what would become one of the first of many databases designed to run on a cluster of machines.
Named after the Greek figure who was cursed to utter the truth but was never believed, Cassandra might seem an odd choice for a system whose raison d'être is believability – but it delivered a nice dig at the stalwarts of the RDBMS world… and their trust in a false Oracle.
Today, Cassandra – now under the umbrella of the Apache Software Foundation (ASF) – is regularly ranked in DB-Engines' top 10 and is used by big name firms like Uber, Twitter and Netflix.
After being driven by Cassandra-based biz Datastax for the majority of its lifetime, the project recently reached a turning point after a falling out between the firm and ASF.
Now, the project is readjusting to life without a single vendor driving it forward, facing new competition and adapting to a rapidly changing tech landscape.
Who needs a scalable database?
Casting back to 2008, Facebook engineers Avinash Lakshman and Prashant Mallik were searching for a way to solve the inbox search problem, to store reverse indices of all the Facebook messages sent and received by users.
"The amount of data to be stored, the rate of growth of the data and the requirement to serve it within strict SLAs made it very apparent that a new storage solution was absolutely essential," Lakshman wrote at the time.
"The solution needed to scale incrementally and in a cost effective fashion. Traditional data storage solutions just wouldn’t fit the bill."
The goal was to develop a scalable, high performance, high availability database system – and the first deployment of Cassandra within Facebook was for the inbox search system storing terabytes of indexes across a cluster of more than 600 cores and 120 TB of disk space.
At the same time, Jonathan Ellis – who would go on to co-found Datastax – was evaluating scalable database technologies for his then employer Rackspace, to tackle issues with scalable storage. After rejecting HBase, CouchDB and MongoDB, he hit upon Cassandra, working on it for about 18 months before forming Datastax.
"It occurred to me, as application development was moving to this cloud application world, this was a problem everyone would run into as they needed to scale to their needs," he told The Reg. "It wasn’t just going to be an exception that the eBays, Facebooks were going to face – it was going to start affecting mainstream development."
However, he said that not everyone agreed. "When we started raising money for Datastax, the most common pushback we got from venture capitalists was, 'There’s five companies in the world that are going to need a scalable database, and Google already has one, Amazon has one, so who's your market going to be?' I think the passage of time has vindicated [our] vision."
A deeply invested community
Having worked on Cassandra before it was brought into the ASF, Ellis was in a good position to request he be made a committer when it was; a year later he became the first project chair, a role he held until 2016.
Back then, there wasn't much of a Cassandra community – something Ellis puts down to Facebook's reasons for releasing the technology. "They weren't looking to be a database vendor. It's a valid way to open-source but there wasn't much of a community."
One of Datastax’s early hires was Patrick McFadin, who quickly settled into the role of community builder, and over the next few years numbers grew and success beckoned; Cassandra was effectively re-written from the 1.0 version and the project can point to a number of technical highs.
"Early on we built a fantastic community of people who were interested in the technology and were using it to solve challenging problems," said Aaron Morton, CEO of Cassandra consultancy The Last Pickle, who got involved in the project at about version 0.3. "The community have always been deeply invested in the technology."
At the outset, the group spent a lot of energy "explaining the compromises and advantages of distributed databases, to get people used to the idea that they don't always need atomic transactions or have to store data in Third Normal Form," Morton said.
So committed were this group of initial adopters, Morton said, that they needed some convincing to get behind the changes brought in with the creation of the Cassandra Query Language (CQL) in version 1.2.
CQL is widely pointed to as the highest point in the technology’s decade. Andrew Cobley, a senior lecturer at the University of Dundee – who discovered Cassandra while trying to decide which NoSQL database to teach to students – describes it as a "game changer".
"It was a really welcome move that made it so much easier to do the programming," he said. "You still had to design your databases and your tables to be efficient, but you didn't then have to struggle with this completely arcane way of trying to query it. If you understood SQL, you cut it down - you had to understand the rules of Cassandra, but once you'd done that, interfacing just felt like with a SQL database."
Another highlight, Cobley said, was the introduction of virtual nodes (vnodes), to simplify management of clusters, while Datastax's Ellis pointed to the implementation of lightweight transactions using a Paxos consensus model, which he reckoned was the first production-ready open-source implementation of Paxos.
Monopolising the community
However, with the smooth comes the rough. Cobley noted that there have been some “minor things that haven’t worked quite so well as they should have” – but Ellis argued that most of the technical problems were “fairly tractable” by getting the right people in the room.
Where things start to get sticky are the non-technical issues, and for Apache Cassandra the stickiest has to be the 2016 rift between Datastax and the foundation.
At the heart of the spat was something not uncommon for the open-source world: the question of how much control a single vendor should have over the direction a project goes in, and when the foundation should get involved.
This is a fine line to tread, and one that Datastax appears to have over-stepped more than once, whether by intention or error.
As one person close to the project, who asked not to be named, said: “There were one too many accusations of strong-arming the project for the ASF board to not take some sort of action.”
For his part, Ellis said the ASF board of directors felt his firm was "monopolising the community, and that – even if Datastax wasn't doing anything nefarious – there was a potential for that in having the founder of Datastax as the PMC [Cassandra Project Management Committee] chair".
Ellis is fairly frank about the political challenges involved. "I wasn’t completely blind to the tension there… [and] was just crossing my fingers that I could stay on the right side of the line. And with mixed success, I guess."
In any event, he said that after seven or so years "it was time to get some new blood and more diversity in the project," adding: "No hard feelings."
In theory, the departure of Datastax could leave the door open for another vendor to step in and lead the way, but observers told The Register that it doesn't seem likely.
Yes sir, no sir, 3 bags NoSQL sir: It's a whizz-bang benchmark ... but WTF does it signify?READ MORE
"By design and necessity, Cassandra is complex enough that you can't just 'bring someone onboard' to offer support in any meaningful way," our source said. "It takes a new developer, even a talented one, over a year to come up to speed on the major components of the system. Someone wanting to enter this market would therefore have to buy their way in by peeling existing talent out of the community."
Getting along without the marketeers
The outcome is that Cassandra is probably now the only popular big data project that doesn't have a vendor involved, which poses organisational, financial and technical problems for the community.
Ellis estimated Datastax had contributed about 85 per cent of the Apache Cassandra code, but it's unsurprising to hear that insiders say this has dropped off, as the firm refocuses on its private fork and enterprise version.
For their part, Datastax execs emphasised that they remain committed to the project, with Ellis pointing out that they still sell the database. "We definitely want Cassandra to succeed, not to burn bridges. There's a virtuous synergy between Datastax and the Cassandra community."
But it's indisputable that the PMC has lost infrastructure and resource: there isn’t a single person paid to do testing and verification; everyone has a different day job; and it’s hard to make sure that disparate groups in different companies don’t duplicate effort.
Meanwhile, the Cassandra Summit, previously organised by Datastax, hasn’t been run for the past two years. Efforts are underway to hold some form of event, but it's likely this will be a lower budget affair in future.
"No doubt Datastax taking a lower profile was challenging," said Morton. "Ultimately though it resulted in a more diverse community as others stepped in to fill the gaps.
"Nate McCall, my co-founder from The Last Pickle, was elected the PMC chair and with a lot of help from the PMC worked to expand the list of committers and encouraged companies that rely on Cassandra to contribute more. In addition we are still getting important contributions from large companies such as Netflix, Uber, and Instagram."
One can imagine the NDAs and corporate charters involved in wrangling with a line-up like that, and building up the trust of the various parties takes time.
But the advantage of big name brands being so invested in the technology means it's unlikely Cassandra will become irrelevant any time soon.
There are other advantages to being out from the umbrella of a single vendor, with our insider saying that features more likely to be community-driven, rather than rammed into the project by "asshole marketeers".
Indeed, the next release is expected to be solely comprised of user-driven features that have been developed by large-scale operations.
"They’ll be trialled by fire," the source said. "When it's released, it's going to have been running in production for a couple of weeks."
In 2008, it would have been hard to imagine the world in which Cassandra now sits.
"The landscape is totally different," said Cobley. "When we started, we started getting old PCs, trying to get it installed by installing every single bit and piece, changing all the configuration files; now when I get students to run Cassandra in the cloud, you just type a single Docker command. And that's it up and running."
Ellis agreed, saying that "if you asked me in 2008, where’s Cassandra going to be in 10 years, I don't know I would have gotten very close to where we are today", pointing in particular to the impact of the cloud.
Cassandra, he said, is in a good position to take advantage of what he sees as a maturation in the move to the cloud.
"It's always been recognised as best in class at running a cluster and replicating a cluster across multiple data centres. If I have my data in a data centre now and want to migrate to cloud over the next three years, Cassandra lets you do that in a much more straightforward way," he said.
But Cassandra's popularity also brings with it competition and, just in time for its tenth anniversary year, a new drop-in replacement for the tech, which claims to be faster, has entered the market, named ScyllaDB.
Proponents of Cassandra argue that the features-led approach of the current PMC will help to fight off such competitors, while others would point to the maturity of Cassandra, established engineers and a supportive community.
Morton added that it was a market validation. "Cassandra is the standard to judge other platforms by. Looking at the long view, Apache Cassandra helped to make the idea of distributed, fault tolerant, databases common place in the industry and products such as ScyllaDB and others are expanding that category."
At the same time, the project is facing a more demanding user base, which is bound to shape its future.
"Apache Cassandra was a bleeding edge project in the early days," said Morton. "Ten years in, most of that promise has come true; Cassandra is an established technology and the idea of running distributed databases is now common.
"In the past the community was happy when it worked; we've moved on and they now expect more than 'it works'." ®
Sponsored: Beyond the Data Frontier