The database abstraction framework strikes back

Original URL: https://www.theregister.com/2006/11/22/cplusplus_database_framework/

Part 2: C++ Generic Coding

Posted in Software, 22nd November 2006 10:33 GMT

In my last article, I looked at one of the differences between the C++ and Java communities; the availability of application development frameworks that have a profound effect on programmer productivity. I mentioned specifically the Java example of Hibernate and tried to identify reasons why the Java community is more innovative with this type of code reuse.

The resulting comments were interesting; particularly the point of the different histories of the two languages. C++ has evolved from C to add object oriented and later generic paradigms, the goal was always to allow exploitation of new modelling techniques and not to find patterns in application development that could be exploited. It could be argued that the contrary applies to Java. Additionally, as the Java syntax is more limited, there are less ways to solve any given problem. Consequently, there is more commonality in solutions to problem classes with Java and, as a result, the process of factorising is easier.

Some people make the point that both Java and C++ have their place and for any given problem, your choice is based on which is more appropriate to the constraints you're operating under. This is true; however, there are a wide range of problems where there isn't a fundamental driver forcing one choice over the other; for example, in web servers Java wins out because of the frameworks that exist to support the application developer. How has Java colonised this space so effectively? If the same frameworks existed in C++ then wouldn't C++ be an equally effective technology choice? As a C++ developer and enthusiast, I don't want to give up this ground so easily!

Finally, there's the argument that it simply isn't possible to implement a tool such as Hibernate without language support such as Reflection. This is where the rest of this article comes in; let's see if we can find a design for such a tool in C++. I think the ideas transform pretty naturally and we can leverage the support in C++ for generic programming to compensate for the absence of Reflection. In fact, just to prove the point I implemented a version and the source is available on SourceForge.net here if anyone wants to play with it...

Hibernate, sort-of, for C++

So, starting from scratch, what's needed to provide an abstraction layer that presents an object oriented interface to a relational data model? Well, we already know this is possible as Hibernate does it for Java, so it's simply a question of providing these capabilities in C++. However, I'm not going to assume prior knowledge of Hibernate and for the sake of respectability, I'm not going to copy blindly what those good people have done before us.

To aid the discussion, Entity Relationship vocabulary is used when referring to the relational domain and Object Oriented vocabulary to refer to the C++ interface we want to provide. I'm also going to work with the simple and classic example of a salesperson and his/her customers. In this relational data model, I have two entities, the salesperson and the customer, and there is a one-to-many relationship that associates one salesperson to his/her many customers (see Figure 1).

Figure 1: Diagram showing the Entity-Relationship Model for customer and salesperson.

So what would a reasonable object-oriented view of this data be? Well, we can imagine that there will be an object representing each entity, one each for salesperson and customer. A second question is more difficult, how do we manage the relationships? We want to be able to ask a salesperson object "who are your customers?" even though in the relational model the question is more "which customers are associated with some salesman"?

The first question is, "how do we access the data model to generate the code"? In theory, the data model is expressed in SQL and we can parse that to generate C++, but this has some issues. Firstly, how do we know which relationships exist? We could, in theory, interpret the constraints on each table, but this may be impractical as the constraints may not be expressed in the same SQL file as the tables.

Furthermore, in some cases the constraints may simply not be expressed in SQL. Secondly, we need to do a mapping from SQL types to C++ types. We could do this by hard coding the C++ type that corresponds with each SQL type but this isn't that flexible, especially when one SQL type could be used with several C++ types (e.g. an SQL Number could be int, unsigned, long, float, double etc). We also would like to allow developers to use their user-defined types to maximise the flexibility of our framework.

For this reason, it makes sense to start with a model description that describes the entities and their relationships and from which we'll generate both the SQL and the C++ object oriented interface. In this file, we will define each of the entities, their attributes and the relationships they participate in. It will also define both the C++ and SQL types for each attribute. We want to keep the core discussion focused here; but there is clearly room in such a description to separate the relational and the object oriented aspects, to support indexes, constraints, how primary keys are handled, what namespace code is generated in, etc.

From this file, we have enough information to generate our SQL schema and the object-oriented interface. Each entity will have a class file; each attribute will have its accessors; and each relationship will become a function that returns a list of objects. The code generator that turns all this information into SQL and C++ code is included in the package on SourceForge.net here. A link to the generated SQL schema is here, and to the generated dbi object code header code here.

Ok, so now we've done the simple part, we've generated a C++ class hierarchy that will allow us to access the relational data. However, we haven't answered such questions as: how we retrieve these objects from the databases; how we update the values of attributes in objects; how do we insert new objects and delete old ones; and how do we manage relationships between objects? The following sections describe how this could all work, and the generated code from the example is included at the end for the practical minded among us.

So, let's start with the insert and update; the easiest approach is to have a user create a new object, set its attributes, and tell us that s/he wants to store it in the database.

To simplify lifetime issues, we want these objects created on the heap, as this way the object can't go out of scope and be destructed while references to the objects exist in other places in the framework. For this reason, the generated constructors are declared private and object creation is by a factory method.

For the "store in database" behaviour, we are going to have a "store" member function that overrides a virtual function in the base class. The implementation of this store function will be the execution of an SQL query generated from the model description file, by the same class generator used above. Any resulting errors are wrapped in an exception and propagated.

Another option would be to use templates and generic programming - more interesting and much cooler - but here we are out to prove a concept, so simple is good. In any event, the speed gains made by statically bound functions as opposed to virtual calls are going to be hidden by the cost of network round trips to the database. We get some complexity because an object that is not yet stored may have been added into relationships; but you can see from the implementations below that this is manageable.

Retrieving objects is another challenge. A simple function that gets an object from a database is one option, but we also want to be able to retrieve a set of objects. Similarly, for delete, a common operation is to delete all the objects corresponding to some criteria so we'll want to provide an algorithm that can iterate over a list of objects calling delete where appropriate. An added complication is that when we delete an object we also need to remove that object from all the relationships it participates in.

Managing the relations of an object is a little harder; this is a key difference between an object-oriented model and a relational model. In the relational model for a one-to-many relation, the relation is normally expressed on the many side. In an object model the many side is often the child of a relation: if a salesperson is associated with many customers then we expect the salesperson object to contain a list of his customers.

There are a number of approaches; the easiest is to express the relation in only one object, either on the salesperson object or on the customer object, but not both. This is convenient and easier to implement, but eventually we want both objects to be able to access their associates easily.

The (potentially) finished framework starts here!

So, here's the code we've generated, complete with a makefile and a vc project file. If you're not convinced, then compile the associated code and run it against a database! If people are interested, this is an open source project (available on SourceForge.net here), as I've implied already; and if anyone wants to help with the transition from a prototype to a real tool then let me know.

The example implementations are in terms of the MySQL database, but they could equally well be implemented in terms of any other interface library, or a database independent interface library – the dependency on MySQL is hidden in the generated cpp files so the database dependency is localised. For the example project, we've submitted the generated code to cvs, but for a real project this would, of course, be questionable.

The link to the Dbi generated source files is here and to the library support code here.

There are some unanswered questions here because of space constraints, so if you find that any of the answers to those posed below would be interesting, please comment or email the author and let us know (comments are, usually, automatically published; emails to the author are only published with explicit permission). We'll probably make them part of future articles. Some possible issues are:

How do we ensure that there are no memory leaks given that all the objects are allocated on the heap?
Can we do this generically instead of polymorphically?
How can we expand this approach to many-to-many relationships?
How can we support late binding of objects associated by relationships?
How can we support the notion of a strong relationship so that one object exists if and only if another corresponding object exists?
How can we expand this approach so as to catch as many breaches of database constraints as possible at compile time? [A "proper" relational database will, of course, enforce such constraints automatically at every access; although some – many - implementations may not – Ed]