Codd almighty! How IBM cracked System R

Original URL: https://www.theregister.com/2013/11/20/ibm_system_r_making_relational_really_real/

The tech team that made relational databases a reality

Posted in Databases, 20th November 2013 10:02 GMT

Few teams have maintained such a fierce community spirit as the IBM pioneers of System R. In Silicon Valley in the early 1970s, this pioneering team proved that a relational database could actually work.

It's a fascinating story, best known today because IBM failed to capitalise on its research. But it's also a timeless one, since the team had to put chalkboard theories into real working code, in the face of widespread skepticism that a computer could actually outperform a human. And that's an engineering challenge that lives on today.

The story goes like this.

Databases had evolved in response to developments in storage - but not always very quickly. The first databases were implemented on punch cards, with a sequential file technology, before the term existed.

Yet when interactive storage came along, the same sequential model followed to the new medium. The newfangled magnetic disks and drums gave programmers fast interactive storage, which in turn, should have permitted more sophisticated interactive queries. But what users got was really "the same but quicker".

By 1970, the leading commercial database of its time was a hierarchical one: IBM's IMS (Information Management System) had been developed to track inventory for the Apollo moon shot. Yet ground-breaking systems like the mainframe timesharing OS Multics were introducing concepts such as hierarchical file systems and dynamic linking - and showing the way for how sophisticated computers would soon become. Could databases keep up?

In the hierarchical IMS world, pieces of data were linked in a tree-like structure. It's still a success today. The relationships between items of data were easy to draw - but laborious to maintain. These hierarchical databases had other limitations. In a hierarchical model, the location of the data had to be known, the physical links had to be maintained, and making changes to the database were daunting. Queries were crude. So despite the advent of "direct action drives", manipulating the data appeared anything but direct.

As ever in computer science, it was widely appreciated that creating a new level of abstraction might cost something in terms of efficiency, but yield potentially huge payoffs in ease of use and the applications for which data could be put. There were two contenders in the research community. But which one would prevail?

"There were two camps, and they were warring with each other. Each camp couldn’t understand what the other was talking about. They had completely different assumptions about what was important," recalled Jim Gray, then a Berkeley PhD who had joined IBM research.

One approach was the network model, introduced to the world in 1969 by Charles Bachman, who had developed one of the first DBMS for General Electric in 1960. Bachman, who will be 90 this year, contributed to many areas of database work including multiuser and programming and in his sixties created the first CASE tools. Bachman described the programmer as a navigator (ref: ACM Turing Award Lecture 1973). In Bachman's navigational or "network" model, records had two-way pointers.

"We have spent the last 50 years with Ptolemaic information systems. These systems, and most of the thinking about systems, were based on a 'computer centered' concept," said Bachman, accepting the ACM's Turing Award in 1973 - you can read his speech here, it gives a great insight into the time. And check out a 2011 interview with The Register, in which he reflects on his career.

Another approach was even more ambitious, and it was advanced by an expat British mathematician and wartime RAF pilot, Ted Codd. Codd's approach was rather different, favouring a declarative model. This meant the programmer "declared" relationships and the computer would be expected to implement them in bits and bytes. Nothing below that should concern the user.

Codd had developed two languages – a relational algebra and a relational calculus – to express extremely complex queries.

"Codd who had some kind of strange mathematical notation, but nobody took it very seriously," mused Don Chamberlin, in our 2003 Reg obituary of Codd. He would become a research colleague of Codd's, and co-inventor of SQL.

It seemed unthinkable that IBM, which then pretty much was the commercial data processing industry, and which had invented so much direct access storage, would be deaf to the idea. And it wasn't.

A 1962 postcard of IBM's San Jose research lab, home of the disk drive and the relational database amongst other inventions. The researchers moved to a 700-acre site in the hills at Almaden in 1986. Much of Cottle Road was destroyed in a suspicious fire in 2008.

Originally data was just stuff that belonged to an application, and even though the commercial database was an IBM success, a little of that was reflected at mighty IBM, too. In 1973 Big Blue decided to do something about this, and consolidated its database research in San Jose, California. IBM gave its research staff beautiful buildings in serene locations - and 5600 Cottle Road, San Jose was in keeping with tradition. Codd had joined IBM's research labs in 1970, and the move brought him into contact to some clever engineers.

"Codd gave a speech [to us] where he said, 'Sure, you can express a question by writing a navigational plan. But if you think about it, the navigational plan isn’t really the essence of what you are trying to accomplish. You are trying to find the answer to some question, and you are expressing the question as a plan for navigating. Wouldn’t it be nice if you could express the question at a higher level and let the system figure out how to do the navigation?’" recalled Chamberlin in a 2001 interview.

"That was Codd’s basic message. He said, ‘If you raise the level of the language that you use to ask questions to a higher and less procedural level, then your question becomes independent of the plan."

Codd only makes one citation in his 1970 paper, referencing work by David Childs on set theory published in 1968. One of the unsung heroes of the story, Childs had been at US defence research lab ARPA in 1965, and at the time ARPA wanted to think about treating data mathematically.

"Childs' 1968 papers and Codd's 1970 paper discussed structure (independent sets, no fixed structure, access by name instead of by pointers) and operations (union, restriction, etc). Childs' papers included benchmark times for doing set operations on an IBM 7090. Codd's 1970 paper introduced normal forms, and his subsequent papers introduced the integrity rules," notes author Ken North.

Access to Childs' 1968 paper was restricted, however. Childs himself would found a successful database company in 1970, with a former president of Chrysler, which was bought by Hitachi in 1984.

Codd got to work with a core team that soon included Gray.

Codd's ideas were considered outré by many

"Ted’s work was mainly of academic interest I would say,” Don Chamberlin reflected later. "It was considered to be a little bit out of the mainstream, somewhat mathematical and theoretical in its flavour."

The industry was developing some standards through a consortium referred to as the CODASYL model - also called the DBTG (Database Task Group) - to create a standard database language.

Chamberlin added: "CODASYL was based on a network data model. It was a little bit more general than the hierarchic data model because it didn’t have the constraint that data had to be organized into a hierarchy—the records could be organised in whatever way you like."

This was fine in theory, but threw a massive research problem at the engineers to crack a working implementation. So IBM created a project with a dozen PhDs: System R (R for Relational). This would prove a relational database was possible, and "not only possible but efficient".

System R made Codd's ideas intelligible

There was a striking disparity between Codd's ideals and his practice.

"Ted couched [his ideas] in mathematical symbolism and terminology. In his original query languages he used mathematical notation, like universal quantifiers and existential quantifiers, and he used Greek letters a lot. Things like that just give the appearance of something being very esoteric and difficult to deal with. Whereas, actually, what he was trying to do was to make queries easier to write, not harder."

It's System R that took Codd's ideas and made them intelligible, by created a simple query language. The researchers query language - initially called SQUARE - or Specifying Queries As Relational Expressions. Even SQUARE had some mathematical notations. Its successor, Structured English Query Language, was based exclusively on English words.

"The development of a language based on English keywords, which you could type on your keyboard and which you could read and understand intuitively, was a breakthrough that made it much easier for people to understand the underlying simplicity of Ted’s idea. It didn’t really make the ideas any more simple; it just made them look simple."

System R member Jim Gray in 1977

"I thought that it was a crazy idea, but it is good for researchers to work on crazy ideas and so I thought maybe something would come of it. We worked on it for about six months and concluded that we didn’t see how by changing the level of abstraction downwards that we were going to make things better. It looked like things would be a lot worse.

System R spent 1976 and 1977 testing the single user system. By 1978 and 1979 IBM was into the third phase of the project. Testers reported that while performance was slow, the woes of the hierarchical been banished: it was easier to design load and then change a database.

In their 1981 paper "A History and Evaluation of System R" (PDF), Chamberlin and his team wrote: "The performance degradation must be traded off against the advantages of normalisation in which large database tables are broken up into smaller parts to avoid redundancy, and then joined together by the view mechanism and other applications."

The SQL language was also a hit. The System R team had made an extraordinary amount of progress: SQL was compiled into machine code. Locking and concurrency issues were tackled, too.

The Berkeley Three

The relational model began to intrigue the computer science community. For those with a maths background, Codd's ideas weren't forbidding. Gray would recall:

"What happened is that the academic community found DBTG and IMS pretty complicated. It wasn’t elegant and there wasn’t a theory associated with it. You couldn’t prove theorems about it, or at least they didn’t figure out how to. And along came the relational model with query optimization, and transactions, and security. The data model was simple enough that you could state it and then start reasoning about it."

Two academics on the other side of the Bay, at Berkeley, had also read Codd's papers, and were trying to do the same thing as the System R team, and put those ideas into practice. The early description of System R had been published. The core team at Berkeley was Mike Stonebraker, a Berkeley researcher (and, like Codd and Childs, a Michigan graduate,) and Gene Wong, his professor, a small number of graduates, and just one full-time programmer.

They called their project INGRES, for INteractive GRaphics and REtrieval System. The team developed its own query language, QUEL. Through necessity, they targeted more modest systems than IBM's mainframes, running an experimental OS called Unix on a DEC PDP 1140. The source code would be made available to anyone who paid a modest fee.

Says Stonebraker: "Jim Gray has a PhD from Berkeley and was around during 1971 and part of 1972, so Eugene and I got to know him and then he went to IBM Research and joined the System R team," he recalled. "So it was mostly his doing that we would go to IBM Research in San Jose or they would come up here. So we probably met every six months, and so we knew what they were doing, they knew what we were doing."

In Stonebraker's modest words, Ingres offered an "unsupported operating system and an unsupported database and no COBOL". Nevertheless it found one significant commercial user - the New York Telephone Company, and by 1978 and 1979 the academics got serious about commercialising it.

At the time, most independent software company were private entities. For the venture capitalists, the risk was high: the risk being that IBM would destroy their newly floated business. This was the background against which Stonebraker started a company - Relational Technology Inc, in 1980 (PDF).

The System R work had been published in the IBM Systems Journal, where mathematician Ed Oates had seen it, and discussed it with an ex-IBM programmer and database developer at Ampex called Larry Ellison.

Inspired by Codd's work and the work of the IBM System R team, Ellison and Oates later co-founded a startup with with Bob Miner which they called Software Development Laboratories. By 1979, they had changed its name to Relational Software, Inc (RSI) – which became Oracle Systems Corporation in 1982, going public in 1986 and becoming known as simply as Oracle Corporation in 1995.

Why not IBM?

IBM's commercialisation of its own work was slower than that of the startups, but not as slow as myth would have it. By 1981, when the ACM published "A History and Evaluation of System R", IBM had still not brought a product to market. A relational product was in testing in 1982, and eventually announced as DB2 in 1983.

But others had created the market, with Ingres gaining a reputation for technical leadership, and Oracle for aggressive marketing.

"And at the time IBM had way more throw weight than they have now, and so in one 'swell foop' they had enough throw weight that that meant SQL was the answer, and anybody with any other query language had to convert to SQL. Larry Ellison very skilfully marketed the heck out of: 'We're selling SQL now’," said Stonebraker.

According to Stonebraker, things could have gone a lot differently: "Oracle and Bill Gates are experts at selling futures to stall the market. Oracle did that very successfully. In my opinion, if IBM hadn’t announced DB2, Oracle and Ingres would've switched places within a couple of years."

For Oracle v3 in 1983 the product was rewritten in C, and was able to run on many different platforms - a huge marketing fillip. Oracle also convinced the market that it was most closely aligned with the coming client/server computing trend - and didn't care what the client might be.

Stonebraker thought IBM had not resolved fundamental issues with the SQL language. Nor did he care for the way that IBM put a relational database on top of hierarchical databases. As he would explain, if IBM had made System/R a high level UI in IMS, it would have been cleaner and smarter. But IMS’s fast, clean and simple hierarchical team had another idea: a logical database. According to Stonebraker, this was a “kludge that made it impossible to put a clean semantic query language on top of IMS".

IBM: Too bad... or just too slow?

Bruce Lindsay, IBM

Did IBM hobble its relational efforts, or was it just slow? Veterans still hotly debate this point today. Entire markets, such as the disk drive industry, were created by ex-IBM staffers because the company was so slow. IBM fellow Bruce Lindsay, one of the younger members of System R who stayed with IBM to become a Fellow Emeritus, certainly thought it was no accident.

"IBM protects weak products; protects its own products," he told the System R reunion in 1995. "Admit it: IBM will not attack its own products, even when they're weak and there's better technology and they have it. Ask Mike about RISC. Ask everybody in here about relational."

There's no doubt management lacked the entrepreneurial culture that allowed Ingres and Oracle to grow so rapidly.

It had been a long, hard journey from System R to DB2, but the engineers had vindicated the decision. "Ted had the key insights and developed the math theory on which RDMBS were based," System R pioneer Don Chamberlin told me in 2003, on Codd's death. "System R were the carpenters who came along and implemented the ideas. We created an industrial strength platform on them."

His summary makes it all sounds simple: "It required a software layer to be implemented that could take a high level language that could map it down to an efficient execution plan."

But that doesn't explain the enormity of the many problems System R had to overcome to achieve this.

Chamberlin said: "At first it wasn't clear you could build an optimizer that would be as efficient as a human programmer. We had the same arguments that you have with people implemented languages like Fortran - I can do a better job with my registers."

A glimpse of the enormity of the team's challenges comes from a tribute written in honour of Jim Gray, who disappeared at sea in 2007 and has never been found. "What really distinguishes Jim's work is the number of times in which he has defined what the hard problems were: recovery, transaction commit, concurrency control, why systems fail, sorting, benchmarking," wrote David Lomet, a colleague at DEC and Microsoft.

The System R team had to grapple with them all. ®

Fun further reading

The 1995 System R reunion site is a wealth of material.

Also worth reading are oral histories with:

This talk by Don Childs (video) gives another overview of the field.

IEEE members and guests can access a special issue of the journal Annals of Computing History from last year, devoted to relational history, including contributions covering IBM, Oracle and Ingres.

And for free, here's Verity Stob's "First Amongst SQLs".

"A History and Evaluation of System R" (PDF) - Chamberlin et al, ACM Computing Practices journal, October 1981.