Codd almighty! How IBM cracked System R
The tech team that made relational databases a reality
Few teams have maintained such a fierce community spirit as the IBM pioneers of System R. In Silicon Valley in the early 1970s, this pioneering team proved that a relational database could actually work.
It's a fascinating story, best known today because IBM failed to capitalise on its research. But it's also a timeless one, since the team had to put chalkboard theories into real working code, in the face of widespread skepticism that a computer could actually outperform a human. And that's an engineering challenge that lives on today.
The story goes like this.
Databases had evolved in response to developments in storage - but not always very quickly. The first databases were implemented on punch cards, with a sequential file technology, before the term existed.
Yet when interactive storage came along, the same sequential model followed to the new medium. The newfangled magnetic disks and drums gave programmers fast interactive storage, which in turn, should have permitted more sophisticated interactive queries. But what users got was really "the same but quicker".
By 1970, the leading commercial database of its time was a hierarchical one: IBM's IMS (Information Management System) had been developed to track inventory for the Apollo moon shot. Yet ground-breaking systems like the mainframe timesharing OS Multics were introducing concepts such as hierarchical file systems and dynamic linking - and showing the way for how sophisticated computers would soon become. Could databases keep up?
In the hierarchical IMS world, pieces of data were linked in a tree-like structure. It's still a success today. The relationships between items of data were easy to draw - but laborious to maintain. These hierarchical databases had other limitations. In a hierarchical model, the location of the data had to be known, the physical links had to be maintained, and making changes to the database were daunting. Queries were crude. So despite the advent of "direct action drives", manipulating the data appeared anything but direct.
As ever in computer science, it was widely appreciated that creating a new level of abstraction might cost something in terms of efficiency, but yield potentially huge payoffs in ease of use and the applications for which data could be put. There were two contenders in the research community. But which one would prevail?
"There were two camps, and they were warring with each other. Each camp couldn’t understand what the other was talking about. They had completely different assumptions about what was important," recalled Jim Gray, then a Berkeley PhD who had joined IBM research.
One approach was the network model, introduced to the world in 1969 by Charles Bachman, who had developed one of the first DBMS for General Electric in 1960. Bachman, who will be 90 this year, contributed to many areas of database work including multiuser and programming and in his sixties created the first CASE tools. Bachman described the programmer as a navigator (ref: ACM Turing Award Lecture 1973). In Bachman's navigational or "network" model, records had two-way pointers.
"We have spent the last 50 years with Ptolemaic information systems. These systems, and most of the thinking about systems, were based on a 'computer centered' concept," said Bachman, accepting the ACM's Turing Award in 1973 - you can read his speech here, it gives a great insight into the time. And check out a 2011 interview with The Register, in which he reflects on his career.
Another approach was even more ambitious, and it was advanced by an expat British mathematician and wartime RAF pilot, Ted Codd. Codd's approach was rather different, favouring a declarative model. This meant the programmer "declared" relationships and the computer would be expected to implement them in bits and bytes. Nothing below that should concern the user.
Codd had developed two languages – a relational algebra and a relational calculus – to express extremely complex queries.
"Codd who had some kind of strange mathematical notation, but nobody took it very seriously," mused Don Chamberlin, in our 2003 Reg obituary of Codd. He would become a research colleague of Codd's, and co-inventor of SQL.
It seemed unthinkable that IBM, which then pretty much was the commercial data processing industry, and which had invented so much direct access storage, would be deaf to the idea. And it wasn't.
A 1962 postcard of IBM's San Jose research lab, home of the disk drive and the relational database amongst other inventions. The researchers moved to a 700-acre site in the hills at Almaden in 1986. Much of Cottle Road was destroyed in a suspicious fire in 2008.
Originally data was just stuff that belonged to an application, and even though the commercial database was an IBM success, a little of that was reflected at mighty IBM, too. In 1973 Big Blue decided to do something about this, and consolidated its database research in San Jose, California. IBM gave its research staff beautiful buildings in serene locations - and 5600 Cottle Road, San Jose was in keeping with tradition. Codd had joined IBM's research labs in 1970, and the move brought him into contact to some clever engineers.
"Codd gave a speech [to us] where he said, 'Sure, you can express a question by writing a navigational plan. But if you think about it, the navigational plan isn’t really the essence of what you are trying to accomplish. You are trying to find the answer to some question, and you are expressing the question as a plan for navigating. Wouldn’t it be nice if you could express the question at a higher level and let the system figure out how to do the navigation?’" recalled Chamberlin in a 2001 interview.
"That was Codd’s basic message. He said, ‘If you raise the level of the language that you use to ask questions to a higher and less procedural level, then your question becomes independent of the plan."
Codd only makes one citation in his 1970 paper, referencing work by David Childs on set theory published in 1968. One of the unsung heroes of the story, Childs had been at US defence research lab ARPA in 1965, and at the time ARPA wanted to think about treating data mathematically.
"Childs' 1968 papers and Codd's 1970 paper discussed structure (independent sets, no fixed structure, access by name instead of by pointers) and operations (union, restriction, etc). Childs' papers included benchmark times for doing set operations on an IBM 7090. Codd's 1970 paper introduced normal forms, and his subsequent papers introduced the integrity rules," notes author Ken North.
Access to Childs' 1968 paper was restricted, however. Childs himself would found a successful database company in 1970, with a former president of Chrysler, which was bought by Hitachi in 1984.
Codd got to work with a core team that soon included Gray.
Codd's ideas were considered outré by many
"Ted’s work was mainly of academic interest I would say,” Don Chamberlin reflected later. "It was considered to be a little bit out of the mainstream, somewhat mathematical and theoretical in its flavour."
The industry was developing some standards through a consortium referred to as the CODASYL model - also called the DBTG (Database Task Group) - to create a standard database language.
Chamberlin added: "CODASYL was based on a network data model. It was a little bit more general than the hierarchic data model because it didn’t have the constraint that data had to be organized into a hierarchy—the records could be organised in whatever way you like."
This was fine in theory, but threw a massive research problem at the engineers to crack a working implementation. So IBM created a project with a dozen PhDs: System R (R for Relational). This would prove a relational database was possible, and "not only possible but efficient".
Sponsored: Flash storage buyer's guide