Evolutionary database design

'Technically trivial'

SANS - Survey on application security programs

Evolutionary Database Design (EDD, aka database refactoring) is a controversial topic in the software development world. It’s a sore point among some DBAs because it promotes the making of changes to schemas that contain live customer data, sometimes with more than one application using the same database.

Given this controversy, and the resistance to iterative techniques among many database specialists, it seemed fair to pitch some of the commonest criticisms to Scott Ambler, co-author with Pramod Sadalage, of the book Refactoring Databases: Evolutionary Database Design.

Headshot of Scott Ambler.

Scott Ambler at Machu Picchu.

Matt Stephens, for Reg Developer: Hi, Scott. First off, could you tell us why you feel there's a need for an evolutionary approach to database design?

Scott: First and foremost, there’s nothing special about databases. Just like evolutionary approaches to design work incredibly well for other aspects of software development they also work well for databases. As Pramod and I show in Refactoring Databases, it's actually a technically trivial thing to evolve an existing relational database schema, even when hundreds of disparate systems access it.

Unfortunately, the traditional data community assumed that evolving database schemas is a hard thing to do and as a result never thought through how to do it. Worse yet, this assumption led to other questionable assumptions, such as the need to model in detail the data aspects of a system early in the lifecycle. The end result is that many data professionals have pretty much missed out on the fundamental ideas and techniques which application developers pretty much take for granted these days. They've got a lot of catching up to do.

Reg Developer: in a sense, the database schema is a reflection of the problem domain: so once designed, it generally shouldn't need to change very much – as long as it's been designed thoughtfully, of course, and grew out of collaboration between the customer, end-users, business analysts etc. Given this, is there really a need for EDD in a software development project?

Scott: I’ll let the existing state of production databases speak to that. How often do you find databases with tables that have columns that are no longer being used, if they ever were? Or columns that are being used for several purposes because it was too hard to add new columns when they were required? Or tables with data quality problems? The Data Warehouse Institute (TDWI) estimated a few years ago that data quality problems result in a $611 billion (yes, billion) loss every year for US-based organizations. This tells me that in practice that we're not very good at getting the design right up front and/or the requirements do in fact change over time. It's incredibly clear that organizations need a safe way to fix existing database problems, and the approach presented in Refactoring Databases is the only viable one on the table. Traditional approaches to database design clearly aren't working out for us.

Reg Developer: Do you see database refactoring being applied to a fine-grained development approach such as TDD, so you'd start with a test, create a table, write another test, add a column, a constraint, etc?

Scott: Absolutely, and in fact I have an article in the May/June 2007 issue of IEEE Software which addresses this specific topic.

Reg Developer: I spoke to a few DBAs and developers prior to this interview; and the main concern that cropped up over and over again was the fact that you're putting a system into production, then making changes that require a schema update and/or data migration; and this may happen as often as every couple of weeks, depending on the release cycle.

Scott: This is why these people need to read Refactoring Databases. First, you develop and test your changes in a developers' sandbox and only when you believe that the changes are working do you promote into higher level sandboxes, including production. Developing directly against a live database is clearly an incredibly risky proposition, so don’t do that.

Second, you do in fact need a viable strategy for evolving an existing production database, as the TDWI figures show. The strategy is to have a transition window during which both the original schema and the fixed schema exist in parallel in production and the database automatically keeps everything up to date. This window may be months or even years long, and the purpose is to give the development teams responsible for the accessing systems time to update their systems and deploy them into production.

Once the systems are updated, the old schema is removed. This is basically the deprecation strategy which Sun Microsystems uses to evolve the Java Development Kit (JDK), and there’s hundreds of thousands of systems which use that. In the book, we provide solutions, including full source code, for over 60 refactorings which take this approach.

Reg Developer: A couple of issues spring to mind with this approach: First, there's the problem of duplicate data. At some point, the data from the old system will need to be migrated – merged in – with the data in the new system, which in the meantime has been accumulating new data. Some of the data in the old system may be stale.

Scott: In our book we show how to deal with this, so it isn’t a problem. Each refactoring includes the code to implement refactoring, including data migration code, as well as how to remove the old schema and scaffolding code to keep the data in sync.

Reg Developer: And the second issue: if you're deploying an updated schema with each new release, and each release takes place every couple of weeks or whatever, isn't that an awful lot of disparate schemas that pile up over time? Keeping track of which application – or which version of each application if it's a thick-client install – uses which version of the schema could end up being something of a maintenance nightmare.

Scott: In practice this doesn’t seem to be much of a problem. If you’re releasing new versions quickly then the changes are pretty small anyway. If one refactoring clobbers another one, for example you rename column X to Y then to Z, you can easily rework your transition schemas to reflect this. It isn’t perfect but if you’re smart, you can reduce the complexity. In the book, we also talk about taking an Agile Model Driven Development (AMDD) approach where you do a bit of up-front modelling. This identifies the overall landscape pretty early in the project and reduces the risk of major refactorings.

Existing database development tools could also do with significant improvement. Over the next few years as EDD gains in popularity, we’ll start seeing better tools out there both commercially and in open source.

As I said earlier, technically database refactoring is trivial and the real challenges are all cultural. The data community decided years ago that EDD is difficult to do and now struggle to understand how it can be done. It’s going to be a huge effort to get existing data professionals to recognize and then overcome many of the false assumptions which their “thought leadership” have made over the years. See this article.

Reg Developer: Some of the refactorings described in your book may have questionable applicability, when applied to a live system. For example, “Introduce Common Format” involves applying a consistent format to all the data values in an existing table column, its motivation being to simplify external program code. Given the issues with refactoring a live schema, isn't this what an OO data access layer is best suited to doing?

Scott: To implement this refactoring you would need to choose a format already supported by the column. For example, if there are 7 different formats, choose one of those formats because all of the accessing systems must already be able to handle it. The next step is to cleanse the existing values to put them all into the common format. You would also need to put code in place, likely a database trigger, which would ensure that any updates to the column are done in the common format. You’d leave this trigger in place during the transition period; while the accessing systems are updated to write only to the column in the common format.

Encapsulating access to a database definitely helps because it reduces coupling to your DB schema, this is also something we talk about in the book. However, what happens if some systems don’t go through your common access layer? What if you have some COBOL systems, some C++, some Java, and some APL systems which access the same database? Gut feel tells me that they’re not going to use the same OO data access layer.

3 Big data security analytics techniques

More from The Register

next story
OpenBSD founder wants to bin buggy OpenSSL library, launches fork
One Heartbleed vuln was too many for Theo de Raadt
Got Windows 8.1 Update yet? Get ready for YET ANOTHER ONE – rumor
Leaker claims big release due this fall as Microsoft herds us into the CLOUD
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Ubuntu 14.04 LTS: Great changes, but sssh don't mention the...
Why HELLO Amazon! You weren't here last time
Patch iOS, OS X now: PDFs, JPEGs, URLs, web pages can pwn your kit
Plus: iThings and desktops at risk of NEW SSL attack flaw
Next Windows obsolescence panic is 450 days from … NOW!
The clock is ticking louder for Windows Server 2003 R2 users
Batten down the hatches, Ubuntu 14.04 LTS due in TWO DAYS
Admins dab straining server brows in advance of Trusty Tahr's long-term support landing
Red Hat to ship RHEL 7 release candidate with a taste of container tech
Grab 'near-final' version of next Enterprise Linux next week
Apple inaugurates free OS X beta program for world+dog
Prerelease software now open to anyone, not just developers – as long as you keep quiet
prev story


Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mainstay ROI - Does application security pay?
In this whitepaper learn how you and your enterprise might benefit from better software security.
Combat fraud and increase customer satisfaction
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.