Data Analysis 2.0
What happens to traditional data analysis in the world of the web?
The World Wide Web can be thought of as one very large database, with information distributed in loosely-connected nodes across a wide array of systems. Compare this to the historically structured world of the relational database management system (RDBMS), where data is neatly managed in tables and columns in a relatively closed environment.
The relational data analysis community has developed a wide array of tools and techniques to help manage and understand the information stored in these systems: entity-relationship (ER) models, naming standards, data dictionaries, corporate data review communities, just to name a few. What happens to the rigour and procedures established by the “old school” data community with the arrival of new technologies such as XML, HTML, and RDF? Has true data analysis been lost with these new arrivals? Or have the core principles of data management been transferred and transformed into these new technologies to provide similar or improved levels of service to the business to provide accurate, relevant, and timely information?
The following sections describe the similarities between traditional relational database modelling techniques and the newer web-based approaches, such as RDF (Resource Description Framework) which can be considered a data model for information on the World Wide Web. We’ll explore the similarities and differences between the technologies and discuss whether the net change is positive or negative for institutions seeking “reliable and understandable information”.
What’s the same?
In very simple terms, an RDF model for the web can be considered as the equivalent of the ER (Entity-Relationship) model for the RDBMS. Let’s look at a simple example. Consider the fact that “The book was written by Jane Doe.” In a traditional ER model, this information would be expressed as shown in Figure 1:
An RDF graph of the same concept would be represented as in Figure 2:
This RDF graph represents a node for the subject “book”, a node for the object “author”, and an arc/line for the predicate relationship between them. This should be fairly straightforward to anyone familiar with basic data modelling techniques (or basic English sentence structure for that matter).
The goal of both an ER model and an RDF graph is to provide a human-readable picture that can then be translated into machine-readable format. Where a physical model in the relational database world can create DDL (data definition language) to execute on a database, an RDF graph can be translated into “triples” where each node and predicate is represented by a URI (uniform resource identifier) which provides the location of the information on the web network. For example, the above graph would become the triple shown in Figure 3, below:
As RDF is based upon the same core concepts of set theory and relational algebra, there are some foundational similarities in the technologies that an individual versed in the fundamentals of data management should easily understand.
While there are many similarities between ER modelling and RDF, there are some core differences as well—both technical and philosophical. In the traditional world of data modelling, the ER model can be a nearly sacred artefact. It is closely controlled by a group of data architects who carefully analyze each relationship for its correct syntax, cardinality, business definition, etc. On the logical or business side, interviews are conducted with a variety of stakeholders to provide the definition and context for each data construct and the relationships between data. On the physical side, changes are closely controlled, as the addition of a single column/field in a database can affect a large number of downstream operational systems.
The World Wide Web has a looser structure—again, both technically and philosophically. Its physical structure consists of an “anything-to-anything” networked relationship of nodes, rather than the structured format of tables and columns in the RDBMS. The creation of links to relevant information is the core driver here. As such, relationships are considered first-class constructs; in that they are represented by their own URI, not dependent solely on the parent/child data objects as in the ER world. And, in the spirit of the Web 2.0, they can be created by anyone—no longer do developers and architects need to wait for the model or schema to be created by and/or approved by a central team. This flexibility allows additional relationships to be created by those with different needs, viewpoints, and perspectives. For example, the shipping department may be more concerned about the book’s weight than its author, and a “has weight of” predicate can be easily added to refer to this additional data about the object.
The Net Result?
A very simplistic comparison then, between the RDBMS and RDF worlds (and this is simplistic—this article is by no means intended as a deep or comprehensive technical comparison) is that of flexibility over control. The traditional rigour of extensive analysis and discussion around a data model brought with it an increased confidence in the accuracy and clarity of an organization’s data. Or did it? How many traditional data analysis projects have failed after years of effort? Why is it so difficult to get consensus and buy-in on common data definitions? Perhaps a better solution is the “2.0” approach of allowing users to create their own data relationship definitions, as in the RDF method. This is the “Wikipedia problem” [possibly – I hope not - exemplified by some of the references I’ve added to this article – Ed]. Will allowing data consumers to “speak for themselves” create chaos or a more accurate and complete definition?
As the semantic web effort gains buy-in and acceptance, there will be more case histories on which to base our decision. One problem to date is that much of the technology still remains in the hands of technical architects and developers. Until RDF becomes more easily accessible to business users, it will be hard to determine whether the “power to the people” approach to data management is a successful one. We have certainly made information accessible to the average business consumer—can we now make it more relevant and understandable? This really does remain to be seen.
Donna Burbank is Director, Enterprise Modelling and Architecture Solutions at Embarcadero Technologies.