Original URL: https://www.theregister.com/2007/07/04/evolutionary_database_design/

Evolutionary database design

'Technically trivial'

By Matt Stephens

Posted in Software, 4th July 2007 07:02 GMT

Evolutionary Database Design (EDD, aka database refactoring) is a controversial topic in the software development world. It’s a sore point among some DBAs because it promotes the making of changes to schemas that contain live customer data, sometimes with more than one application using the same database.

Given this controversy, and the resistance to iterative techniques among many database specialists, it seemed fair to pitch some of the commonest criticisms to Scott Ambler, co-author with Pramod Sadalage, of the book Refactoring Databases: Evolutionary Database Design.

Headshot of Scott Ambler.

Scott Ambler at Machu Picchu.

Matt Stephens, for Reg Developer: Hi, Scott. First off, could you tell us why you feel there's a need for an evolutionary approach to database design?

Scott: First and foremost, there’s nothing special about databases. Just like evolutionary approaches to design work incredibly well for other aspects of software development they also work well for databases. As Pramod and I show in Refactoring Databases, it's actually a technically trivial thing to evolve an existing relational database schema, even when hundreds of disparate systems access it.

Unfortunately, the traditional data community assumed that evolving database schemas is a hard thing to do and as a result never thought through how to do it. Worse yet, this assumption led to other questionable assumptions, such as the need to model in detail the data aspects of a system early in the lifecycle. The end result is that many data professionals have pretty much missed out on the fundamental ideas and techniques which application developers pretty much take for granted these days. They've got a lot of catching up to do.

Reg Developer: in a sense, the database schema is a reflection of the problem domain: so once designed, it generally shouldn't need to change very much – as long as it's been designed thoughtfully, of course, and grew out of collaboration between the customer, end-users, business analysts etc. Given this, is there really a need for EDD in a software development project?

Scott: I’ll let the existing state of production databases speak to that. How often do you find databases with tables that have columns that are no longer being used, if they ever were? Or columns that are being used for several purposes because it was too hard to add new columns when they were required? Or tables with data quality problems? The Data Warehouse Institute (TDWI) estimated a few years ago that data quality problems result in a $611 billion (yes, billion) loss every year for US-based organizations. This tells me that in practice that we're not very good at getting the design right up front and/or the requirements do in fact change over time. It's incredibly clear that organizations need a safe way to fix existing database problems, and the approach presented in Refactoring Databases is the only viable one on the table. Traditional approaches to database design clearly aren't working out for us.

Reg Developer: Do you see database refactoring being applied to a fine-grained development approach such as TDD, so you'd start with a test, create a table, write another test, add a column, a constraint, etc?

Scott: Absolutely, and in fact I have an article in the May/June 2007 issue of IEEE Software which addresses this specific topic.

Reg Developer: I spoke to a few DBAs and developers prior to this interview; and the main concern that cropped up over and over again was the fact that you're putting a system into production, then making changes that require a schema update and/or data migration; and this may happen as often as every couple of weeks, depending on the release cycle.

Scott: This is why these people need to read Refactoring Databases. First, you develop and test your changes in a developers' sandbox and only when you believe that the changes are working do you promote into higher level sandboxes, including production. Developing directly against a live database is clearly an incredibly risky proposition, so don’t do that.

Second, you do in fact need a viable strategy for evolving an existing production database, as the TDWI figures show. The strategy is to have a transition window during which both the original schema and the fixed schema exist in parallel in production and the database automatically keeps everything up to date. This window may be months or even years long, and the purpose is to give the development teams responsible for the accessing systems time to update their systems and deploy them into production.

Once the systems are updated, the old schema is removed. This is basically the deprecation strategy which Sun Microsystems uses to evolve the Java Development Kit (JDK), and there’s hundreds of thousands of systems which use that. In the book, we provide solutions, including full source code, for over 60 refactorings which take this approach.

Reg Developer: A couple of issues spring to mind with this approach: First, there's the problem of duplicate data. At some point, the data from the old system will need to be migrated – merged in – with the data in the new system, which in the meantime has been accumulating new data. Some of the data in the old system may be stale.

Scott: In our book we show how to deal with this, so it isn’t a problem. Each refactoring includes the code to implement refactoring, including data migration code, as well as how to remove the old schema and scaffolding code to keep the data in sync.

Reg Developer: And the second issue: if you're deploying an updated schema with each new release, and each release takes place every couple of weeks or whatever, isn't that an awful lot of disparate schemas that pile up over time? Keeping track of which application – or which version of each application if it's a thick-client install – uses which version of the schema could end up being something of a maintenance nightmare.

Scott: In practice this doesn’t seem to be much of a problem. If you’re releasing new versions quickly then the changes are pretty small anyway. If one refactoring clobbers another one, for example you rename column X to Y then to Z, you can easily rework your transition schemas to reflect this. It isn’t perfect but if you’re smart, you can reduce the complexity. In the book, we also talk about taking an Agile Model Driven Development (AMDD) approach where you do a bit of up-front modelling. This identifies the overall landscape pretty early in the project and reduces the risk of major refactorings.

Existing database development tools could also do with significant improvement. Over the next few years as EDD gains in popularity, we’ll start seeing better tools out there both commercially and in open source.

As I said earlier, technically database refactoring is trivial and the real challenges are all cultural. The data community decided years ago that EDD is difficult to do and now struggle to understand how it can be done. It’s going to be a huge effort to get existing data professionals to recognize and then overcome many of the false assumptions which their “thought leadership” have made over the years. See this article.

Reg Developer: Some of the refactorings described in your book may have questionable applicability, when applied to a live system. For example, “Introduce Common Format” involves applying a consistent format to all the data values in an existing table column, its motivation being to simplify external program code. Given the issues with refactoring a live schema, isn't this what an OO data access layer is best suited to doing?

Scott: To implement this refactoring you would need to choose a format already supported by the column. For example, if there are 7 different formats, choose one of those formats because all of the accessing systems must already be able to handle it. The next step is to cleanse the existing values to put them all into the common format. You would also need to put code in place, likely a database trigger, which would ensure that any updates to the column are done in the common format. You’d leave this trigger in place during the transition period; while the accessing systems are updated to write only to the column in the common format.

Encapsulating access to a database definitely helps because it reduces coupling to your DB schema, this is also something we talk about in the book. However, what happens if some systems don’t go through your common access layer? What if you have some COBOL systems, some C++, some Java, and some APL systems which access the same database? Gut feel tells me that they’re not going to use the same OO data access layer.

Reg Developer: Many of the refactorings in your book involve applying triggers to ensure data integrity. So database refactoring differs from code refactoring, in the sense that code changes are “absolute” and replace what went before, whereas DB refactorings accumulate in layers of criss-crossing triggers. Do you see this as a problem?

Scott: This is why there is a defined transition window. At the end of the window, you need to remove the old schema and any supporting triggers. This way the patches are only there for a short period of time. If you don’t remove them then you’re going to run into problems. To be fair, how complex is it to schedule the running of a script which does the appropriate cleanup? To make things easy for our readers we included examples of such scripts for each refactoring.

Reg Developer: Of course there's also a danger that all those triggers will end up a drag on the system (i.e. slow the database down).

Scott: Yes, you’re going to take a performance hit. But you’re already taking a data quality hit and as TDWI argues that seems to add up. So you need to make a trade-off between performance and financial cost. Costs include development team productivity loss because data professionals can’t react swiftly to changes during development, the cost of business mistakes due to poor quality data, and the additional development costs associated with additional programming code to work with less-than-ideal database schemas.

And what about the performance hit taken by the applications with all of the additional workaround code because of database problems? Why is it that this is rarely mentioned?

Reg Developer: Can EDD work with a database schema shared by many projects? It's effectively decentralizing the ER design, which sounds as if you're inviting Mr Chaos to tea. (I guess this is where an effective DBA comes in, to act as a central gateway for all changes).

Scott: The book is written under the assumption that there are hundreds of other disparate systems, all of which are outside the scope of your control, accessing your database. These systems run on different platforms and have different release cycles. This is why the transition window is critical to your success.

Reg Developer: I've been a programmer in projects where the database design is handled by a separate team. Often the schema design was out of control, changing radically from one day to the next, breaking my code. I can just imagine the frustration of a programming team having to cope with a separate DB team practising EDD, and being told to “get with the times” when they complain that the shifting database keeps breaking their code. How does EDD address this issue?

Scott: It’s incredibly inefficient to have the data team working separately from the development team, and this is something that I’ve written about extensively in Agile Database Techniques and in other books. Any organization that chooses to work this way gets what they deserve.

Reg Developer: I agree up to a point. Yes, the DBA should be helping the developers from day -1 and should be part of the team; but looking at the data independently of the programmers is a great way of finding defects (such as gaps in developer understanding) well before they get built into the code.

Another issue raised by the DBAs I spoke to is that EDD shakes up the core competencies: that is, the physical database design is suddenly being done by programmers, without the necessary step of a DBA optimizing (or correcting, some might say) the design before it goes live. You effectively lose the shield of having separate logical and physical ER designs.

Scott: It’s pretty much recognized within the agile community that it’s inefficient to build teams made up of specialists. Instead, you want people who are generalizing specialists with one or more specialities, such as database development or Java programming, and a general knowledge of software development and the domain. Generalizing specialists are far more effective than specialists, so you increase your overall productivity. It’s an assumption that you need a DBA to optimize the design, you just need someone on the team with the requisite skills to do the work, and it doesn’t have to be a DBA per se. There are several performance refactorings and they’re pretty straightforward in practice, so perhaps the real issue here is the cultural one of moving away from over specialization towards the more skilled paradigm of generalizing specialists.

Also, is having separate logical and physical ER designs really a shield or is it simply busy work?

Reg Developer: If you want to make changes to an enterprise schema in use by multiple applications, you're likely to run into resistance in the organization. Different projects will operate according to their own testing and release schedules. Can an evolutionary design approach really work in this sort of environment?

Scott: Like I said, the cultural issues are the difficult ones. Such an organization would have to choose to succeed, but unfortunately, it’s very easy to choose to fail. Hence the $611 billion data quality problem that we currently have on our hands.

Reg Developer: To finish off, where do you see evolutionary database design heading in the near future?

Scott: Right now we’re at the beginning of the adoption curve. The techniques are in place but we need to educate and mentor people in them. This will definitely take time as we need to overcome some very serious cultural challenges within the existing data community.

Scott Ambler is a noted author and speaker on object-oriented software development, software process and the like. He currently works with IBM as Practice Leader Agile Development within the IBM Methods group. He is Canadian and still lives in Canada although he spends a lot of time consulting in the United States and Europe.

Matt Stephens is a Java developer and project leader based in Central London. He’s the co-author of Extreme Programming Refactored which objectively throws XP into a pit of rabid hamsters, Agile Development with the ICONIX Process and most recently, Use Case Driven Object Modelling with UML: Theory and Practice.