Evolutionary database design

'Technically trivial'

Build a business case: developing custom apps

Evolutionary Database Design (EDD, aka database refactoring) is a controversial topic in the software development world. It’s a sore point among some DBAs because it promotes the making of changes to schemas that contain live customer data, sometimes with more than one application using the same database.

Given this controversy, and the resistance to iterative techniques among many database specialists, it seemed fair to pitch some of the commonest criticisms to Scott Ambler, co-author with Pramod Sadalage, of the book Refactoring Databases: Evolutionary Database Design.

Headshot of Scott Ambler.

Scott Ambler at Machu Picchu.

Matt Stephens, for Reg Developer: Hi, Scott. First off, could you tell us why you feel there's a need for an evolutionary approach to database design?

Scott: First and foremost, there’s nothing special about databases. Just like evolutionary approaches to design work incredibly well for other aspects of software development they also work well for databases. As Pramod and I show in Refactoring Databases, it's actually a technically trivial thing to evolve an existing relational database schema, even when hundreds of disparate systems access it.

Unfortunately, the traditional data community assumed that evolving database schemas is a hard thing to do and as a result never thought through how to do it. Worse yet, this assumption led to other questionable assumptions, such as the need to model in detail the data aspects of a system early in the lifecycle. The end result is that many data professionals have pretty much missed out on the fundamental ideas and techniques which application developers pretty much take for granted these days. They've got a lot of catching up to do.

Reg Developer: in a sense, the database schema is a reflection of the problem domain: so once designed, it generally shouldn't need to change very much – as long as it's been designed thoughtfully, of course, and grew out of collaboration between the customer, end-users, business analysts etc. Given this, is there really a need for EDD in a software development project?

Scott: I’ll let the existing state of production databases speak to that. How often do you find databases with tables that have columns that are no longer being used, if they ever were? Or columns that are being used for several purposes because it was too hard to add new columns when they were required? Or tables with data quality problems? The Data Warehouse Institute (TDWI) estimated a few years ago that data quality problems result in a $611 billion (yes, billion) loss every year for US-based organizations. This tells me that in practice that we're not very good at getting the design right up front and/or the requirements do in fact change over time. It's incredibly clear that organizations need a safe way to fix existing database problems, and the approach presented in Refactoring Databases is the only viable one on the table. Traditional approaches to database design clearly aren't working out for us.

Reg Developer: Do you see database refactoring being applied to a fine-grained development approach such as TDD, so you'd start with a test, create a table, write another test, add a column, a constraint, etc?

Scott: Absolutely, and in fact I have an article in the May/June 2007 issue of IEEE Software which addresses this specific topic.

Reg Developer: I spoke to a few DBAs and developers prior to this interview; and the main concern that cropped up over and over again was the fact that you're putting a system into production, then making changes that require a schema update and/or data migration; and this may happen as often as every couple of weeks, depending on the release cycle.

Scott: This is why these people need to read Refactoring Databases. First, you develop and test your changes in a developers' sandbox and only when you believe that the changes are working do you promote into higher level sandboxes, including production. Developing directly against a live database is clearly an incredibly risky proposition, so don’t do that.

Second, you do in fact need a viable strategy for evolving an existing production database, as the TDWI figures show. The strategy is to have a transition window during which both the original schema and the fixed schema exist in parallel in production and the database automatically keeps everything up to date. This window may be months or even years long, and the purpose is to give the development teams responsible for the accessing systems time to update their systems and deploy them into production.

Once the systems are updated, the old schema is removed. This is basically the deprecation strategy which Sun Microsystems uses to evolve the Java Development Kit (JDK), and there’s hundreds of thousands of systems which use that. In the book, we provide solutions, including full source code, for over 60 refactorings which take this approach.

Reg Developer: A couple of issues spring to mind with this approach: First, there's the problem of duplicate data. At some point, the data from the old system will need to be migrated – merged in – with the data in the new system, which in the meantime has been accumulating new data. Some of the data in the old system may be stale.

Scott: In our book we show how to deal with this, so it isn’t a problem. Each refactoring includes the code to implement refactoring, including data migration code, as well as how to remove the old schema and scaffolding code to keep the data in sync.

Reg Developer: And the second issue: if you're deploying an updated schema with each new release, and each release takes place every couple of weeks or whatever, isn't that an awful lot of disparate schemas that pile up over time? Keeping track of which application – or which version of each application if it's a thick-client install – uses which version of the schema could end up being something of a maintenance nightmare.

Scott: In practice this doesn’t seem to be much of a problem. If you’re releasing new versions quickly then the changes are pretty small anyway. If one refactoring clobbers another one, for example you rename column X to Y then to Z, you can easily rework your transition schemas to reflect this. It isn’t perfect but if you’re smart, you can reduce the complexity. In the book, we also talk about taking an Agile Model Driven Development (AMDD) approach where you do a bit of up-front modelling. This identifies the overall landscape pretty early in the project and reduces the risk of major refactorings.

Existing database development tools could also do with significant improvement. Over the next few years as EDD gains in popularity, we’ll start seeing better tools out there both commercially and in open source.

As I said earlier, technically database refactoring is trivial and the real challenges are all cultural. The data community decided years ago that EDD is difficult to do and now struggle to understand how it can be done. It’s going to be a huge effort to get existing data professionals to recognize and then overcome many of the false assumptions which their “thought leadership” have made over the years. See this article.

Reg Developer: Some of the refactorings described in your book may have questionable applicability, when applied to a live system. For example, “Introduce Common Format” involves applying a consistent format to all the data values in an existing table column, its motivation being to simplify external program code. Given the issues with refactoring a live schema, isn't this what an OO data access layer is best suited to doing?

Scott: To implement this refactoring you would need to choose a format already supported by the column. For example, if there are 7 different formats, choose one of those formats because all of the accessing systems must already be able to handle it. The next step is to cleanse the existing values to put them all into the common format. You would also need to put code in place, likely a database trigger, which would ensure that any updates to the column are done in the common format. You’d leave this trigger in place during the transition period; while the accessing systems are updated to write only to the column in the common format.

Encapsulating access to a database definitely helps because it reduces coupling to your DB schema, this is also something we talk about in the book. However, what happens if some systems don’t go through your common access layer? What if you have some COBOL systems, some C++, some Java, and some APL systems which access the same database? Gut feel tells me that they’re not going to use the same OO data access layer.

Boost IT visibility and business value

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft refuses to nip 'Windows 9' unzip lip slip
Look at the shiny Windows 8.1, why can't you people talk about 8.1, sobs an exec somewhere
Munich considers dumping Linux for ... GULP ... Windows!
Give a penguinista a hug, the Outlook's not good for open source's poster child
Intel's Raspberry Pi rival Galileo can now run Windows
Behold the Internet of Things. Wintel Things
Linux Foundation says many Linux admins and engineers are certifiable
Floats exam program to help IT employers lock up talent
Microsoft cries UNINSTALL in the wake of Blue Screens of Death™
Cache crash causes contained choloric calamity
Eat up Martha! Microsoft slings handwriting recog into OneNote on Android
Freehand input on non-Windows kit for the first time
prev story


Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
7 Elements of Radically Simple OS Migration
Avoid the typical headaches of OS migration during your next project by learning about 7 elements of radically simple OS migration.
BYOD's dark side: Data protection
An endpoint data protection solution that adds value to the user and the organization so it can protect itself from data loss as well as leverage corporate data.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?