Multivalued datatypes considered harmful

Original URL: https://www.theregister.com/2006/07/18/multivalued_datatypes_access/

How dangerous can a data type be?

Posted in Software, 18th July 2006 11:10 GMT

Increasingly developers are required to write applications that interact with database engines – typically Oracle, SQL Server, DB2, MySQL or Access. In many ways the database engine is pretty much immaterial; no matter what the flavour it’s still simply a matter of tables, columns, rows and a variety of data types; text, memo, BLOB, numeric, whatever. However if you work with Access, a completely new data type is on the horizon for 2007 – multi-valued. Unfortunately, this isn’t just-another-data-type; this is a whole different ball game and a dangerous one – more like rollerball than baseball.

As the name suggests, a multi-valued field is one in which you can place more than one value. So, imagine that you design a table to store information about, say, customers.

CUSTOMER

CustID	FName	LName	Hobbies
1	Fred	Smith	Fishing, Rollerball, Hockey
2	Sally	Jones	Sailing
3	Brian	Wilson	Gliding, Sailing, Singing, Hockey

The hobbies column is a multi-valued field. In some ways this is very neat because we have created a many-to-many join (many different customers can have many different hobbies) using a single table. The alternative, and traditional way, is to use three tables.

CUSTOMER

CustID	FName	LName
1	Fred	Smith
2	Sally	Jones
3	Brian	Wilson

HOBBY

HobbyID	Hobbies
1	Fishing,
2	Rollerball
3	Hockey
4	Sailing
5	Gliding
6	Singing

CUSTOMER/HOBBY

CustID	HobbyID
1	1
1	2
1	3
2	4
3	5
3	4
3	6
3	3

Clearly, the solution using three tables is more complex and less intuitive; so what is wrong with the multi-valued data type solution? Well, in his initial set of rules defining relational databases, Ted Codd (the originator of the relational model) forbad their use.

Rule 2, the guaranteed access rule.

Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.

If we use the table name (Customer), Primary key value (1) and column name (Hobbies) we don’t get a single atomic value (such as ‘Fishing’); we get multiple pieces of data (Fishing, Rollerball, Hockey).

Now this is a killer argument if you are a database freak like me (“If Ted Codd forbad it, I want no further truck with your multi-valued data types.”) but I quite understand that, if you are an application developer, the finer points of relational database theory often sound like just so much academic nonsense. If a new feature makes life easier, who cares if it happens to break some arbitrary rule written over 20 years ago?

Fair enough. So let’s look at an intensely practical reason why multi-valued fields are so bad. We query databases using SQL. The design of SQL is based entirely on the assumption that each column contains atomic values. If we run a normal SQL query against our single table solution:

SELECT FName FROM CUSTOMER
WHERE Hobby = “Rollerball”

It will return zero rows; despite the fact that one of our customers plays rollerball, because there is no row with a field just containing “rollerball”.

If you want a challenge, try to construct the SQL necessary to find the names of the customers who both glide and sail. It is, of course, possible, but the solution is more complex than extracting the same information from the three table structure.

And if you are not convinced that we are plunging into deep water here, imagine that I store foreign key values (in the example shown, into a PRODUCT table) in a multi valued field:

ORDER

OrderID	ODate	Products
1	1/1/07	1,5,5,5,5,5,67,434,434,5654
2	1/1/07	45,67,454,454,454,65556
3	2/1/07	2,454,5677

Assuming that the PRODUCT table contains price information, try writing a query that finds the total value of each order.

One of the huge advantages of the relational model of data is that its behaviour under querying is completely predictable. Multi valued fields erode the certainty that a given query will return the canonically correct set of data as an answer.

Does the Access development team really know as little about relational database theory as this bizarre addition implies? At TechEd in Boston last month, I had a long talk with Suraj Poozhiyil, the Access Program Manager. The good news is that Suraj did, at least, convince me that the team was fully aware of the implications of introducing this new data type. So why has Microsoft done it?

According to Suraj, there were two main reasons.

The first is that Microsoft is keen for Access to be compatible with SharePoint (which already supports multi-value fields). Fine, but one has to ask whether this is for our benefit or Microsoft’s.

The second reason is that the company does seem to be genuinely interested in making the product easier for power users to drive. The development team feels that power users find the creation of many-to-many joins using three tables conceptually very difficult and will find multi-valued data types a much easier solution. Having taught Access to such users since Access 1.0 I cannot help but agree with this. Access power users will find this solution easier.

So we have a classic problem. Should we sacrifice accuracy for ease of use? Let’s reserve judgement for a moment because, in practice, we don’t have to do it in this case.

It would be possible for Microsoft to be very clever here. It could present the power user with what appears to be a multi-valued data type in the user-interface but, under the covers, it could store the data in the classic three table solution. I put this to Suraj and he agreed. “In fact”, he said “that is what we have done.”

And it is. Well, very nearly; tantalisingly nearly. But sadly, not quite. So let’s look at the precise way in which Microsoft has implemented multi-valued fields and the implications that carries for developers.

In Access 2007 users are presented with a wizard driven interface that allows them to construct and use multi-valued fields. If you were then to address the resulting database programmatically, you would see three tables. Fine so far. The really weird part is that if you actually use the GUI to look at the tables you see only one table. And if you use the query builder to create SQL to run against the single table that it sees, you find exactly the non-canonical behaviour that caused multi-valued data types to be reviled in the first place. Sigh.

Suraj tells me that ideally the development team would like the three table solution to appear in the GUI but that it won’t be possible to do this for the version currently under development (Access 2007). It may well appear in later versions.

My feeling is that this is a mistake and the ‘feature’ should either be fully implemented or not at all. However, both Suraj and I agree wholeheartedly that developers do not need to use multi-valued fields. People who understand databases already have a good way of implementing many to many relationships and will gain no benefit from multi-valued fields.

So, my clear and certain advice to developers is not to use multi-valued fields. They have nothing to offer us except potential pain.

Multivalued datatypes considered harmful

Related stories