Evolutionary database design
Evolutionary Database Design (EDD, aka database refactoring) is a controversial topic in the software development world. It’s a sore point among some DBAs because it promotes the making of changes to schemas that contain live customer data, sometimes with more than one application using the same database.
Given this controversy, and the resistance to iterative techniques among many database specialists, it seemed fair to pitch some of the commonest criticisms to Scott Ambler, co-author with Pramod Sadalage, of the book Refactoring Databases: Evolutionary Database Design.
Scott Ambler at Machu Picchu.
Matt Stephens, for Reg Developer: Hi, Scott. First off, could you tell us why you feel there's a need for an evolutionary approach to database design?
Scott: First and foremost, there’s nothing special about databases. Just like evolutionary approaches to design work incredibly well for other aspects of software development they also work well for databases. As Pramod and I show in Refactoring Databases, it's actually a technically trivial thing to evolve an existing relational database schema, even when hundreds of disparate systems access it.
Unfortunately, the traditional data community assumed that evolving database schemas is a hard thing to do and as a result never thought through how to do it. Worse yet, this assumption led to other questionable assumptions, such as the need to model in detail the data aspects of a system early in the lifecycle. The end result is that many data professionals have pretty much missed out on the fundamental ideas and techniques which application developers pretty much take for granted these days. They've got a lot of catching up to do.
Reg Developer: in a sense, the database schema is a reflection of the problem domain: so once designed, it generally shouldn't need to change very much – as long as it's been designed thoughtfully, of course, and grew out of collaboration between the customer, end-users, business analysts etc. Given this, is there really a need for EDD in a software development project?
Scott: I’ll let the existing state of production databases speak to that. How often do you find databases with tables that have columns that are no longer being used, if they ever were? Or columns that are being used for several purposes because it was too hard to add new columns when they were required? Or tables with data quality problems? The Data Warehouse Institute (TDWI) estimated a few years ago that data quality problems result in a $611 billion (yes, billion) loss every year for US-based organizations. This tells me that in practice that we're not very good at getting the design right up front and/or the requirements do in fact change over time. It's incredibly clear that organizations need a safe way to fix existing database problems, and the approach presented in Refactoring Databases is the only viable one on the table. Traditional approaches to database design clearly aren't working out for us.
Reg Developer: Do you see database refactoring being applied to a fine-grained development approach such as TDD, so you'd start with a test, create a table, write another test, add a column, a constraint, etc?
Scott: Absolutely, and in fact I have an article in the May/June 2007 issue of IEEE Software which addresses this specific topic.
Reg Developer: I spoke to a few DBAs and developers prior to this interview; and the main concern that cropped up over and over again was the fact that you're putting a system into production, then making changes that require a schema update and/or data migration; and this may happen as often as every couple of weeks, depending on the release cycle.
Scott: This is why these people need to read Refactoring Databases. First, you develop and test your changes in a developers' sandbox and only when you believe that the changes are working do you promote into higher level sandboxes, including production. Developing directly against a live database is clearly an incredibly risky proposition, so don’t do that.
Second, you do in fact need a viable strategy for evolving an existing production database, as the TDWI figures show. The strategy is to have a transition window during which both the original schema and the fixed schema exist in parallel in production and the database automatically keeps everything up to date. This window may be months or even years long, and the purpose is to give the development teams responsible for the accessing systems time to update their systems and deploy them into production.
Once the systems are updated, the old schema is removed. This is basically the deprecation strategy which Sun Microsystems uses to evolve the Java Development Kit (JDK), and there’s hundreds of thousands of systems which use that. In the book, we provide solutions, including full source code, for over 60 refactorings which take this approach.
Reg Developer: A couple of issues spring to mind with this approach: First, there's the problem of duplicate data. At some point, the data from the old system will need to be migrated – merged in – with the data in the new system, which in the meantime has been accumulating new data. Some of the data in the old system may be stale.
Scott: In our book we show how to deal with this, so it isn’t a problem. Each refactoring includes the code to implement refactoring, including data migration code, as well as how to remove the old schema and scaffolding code to keep the data in sync.
Reg Developer: And the second issue: if you're deploying an updated schema with each new release, and each release takes place every couple of weeks or whatever, isn't that an awful lot of disparate schemas that pile up over time? Keeping track of which application – or which version of each application if it's a thick-client install – uses which version of the schema could end up being something of a maintenance nightmare.
Scott: In practice this doesn’t seem to be much of a problem. If you’re releasing new versions quickly then the changes are pretty small anyway. If one refactoring clobbers another one, for example you rename column X to Y then to Z, you can easily rework your transition schemas to reflect this. It isn’t perfect but if you’re smart, you can reduce the complexity. In the book, we also talk about taking an Agile Model Driven Development (AMDD) approach where you do a bit of up-front modelling. This identifies the overall landscape pretty early in the project and reduces the risk of major refactorings.
Existing database development tools could also do with significant improvement. Over the next few years as EDD gains in popularity, we’ll start seeing better tools out there both commercially and in open source.
As I said earlier, technically database refactoring is trivial and the real challenges are all cultural. The data community decided years ago that EDD is difficult to do and now struggle to understand how it can be done. It’s going to be a huge effort to get existing data professionals to recognize and then overcome many of the false assumptions which their “thought leadership” have made over the years. See this article.
Reg Developer: Some of the refactorings described in your book may have questionable applicability, when applied to a live system. For example, “Introduce Common Format” involves applying a consistent format to all the data values in an existing table column, its motivation being to simplify external program code. Given the issues with refactoring a live schema, isn't this what an OO data access layer is best suited to doing?
Scott: To implement this refactoring you would need to choose a format already supported by the column. For example, if there are 7 different formats, choose one of those formats because all of the accessing systems must already be able to handle it. The next step is to cleanse the existing values to put them all into the common format. You would also need to put code in place, likely a database trigger, which would ensure that any updates to the column are done in the common format. You’d leave this trigger in place during the transition period; while the accessing systems are updated to write only to the column in the common format.
Encapsulating access to a database definitely helps because it reduces coupling to your DB schema, this is also something we talk about in the book. However, what happens if some systems don’t go through your common access layer? What if you have some COBOL systems, some C++, some Java, and some APL systems which access the same database? Gut feel tells me that they’re not going to use the same OO data access layer.
Sponsored: Hyper-scale data management