Feeds

Data Analysis 2.0

What happens to traditional data analysis in the world of the web?

Intelligent flash storage arrays

The World Wide Web can be thought of as one very large database, with information distributed in loosely-connected nodes across a wide array of systems. Compare this to the historically structured world of the relational database management system (RDBMS), where data is neatly managed in tables and columns in a relatively closed environment.

The relational data analysis community has developed a wide array of tools and techniques to help manage and understand the information stored in these systems: entity-relationship (ER) models, naming standards, data dictionaries, corporate data review communities, just to name a few. What happens to the rigour and procedures established by the “old school” data community with the arrival of new technologies such as XML, HTML, and RDF? Has true data analysis been lost with these new arrivals? Or have the core principles of data management been transferred and transformed into these new technologies to provide similar or improved levels of service to the business to provide accurate, relevant, and timely information?

Picture of Donna Burbank.

The following sections describe the similarities between traditional relational database modelling techniques and the newer web-based approaches, such as RDF (Resource Description Framework) which can be considered a data model for information on the World Wide Web. We’ll explore the similarities and differences between the technologies and discuss whether the net change is positive or negative for institutions seeking “reliable and understandable information”.

What’s the same?

In very simple terms, an RDF model for the web can be considered as the equivalent of the ER (Entity-Relationship) model for the RDBMS. Let’s look at a simple example. Consider the fact that “The book was written by Jane Doe.” In a traditional ER model, this information would be expressed as shown in Figure 1:

Diagram showing an Entity Relationship Model.

An RDF graph of the same concept would be represented as in Figure 2:

Diagram showing an RDF Graph.

This RDF graph represents a node for the subject “book”, a node for the object “author”, and an arc/line for the predicate relationship between them. This should be fairly straightforward to anyone familiar with basic data modelling techniques (or basic English sentence structure for that matter).

The goal of both an ER model and an RDF graph is to provide a human-readable picture that can then be translated into machine-readable format. Where a physical model in the relational database world can create DDL (data definition language) to execute on a database, an RDF graph can be translated into “triples” where each node and predicate is represented by a URI (uniform resource identifier) which provides the location of the information on the web network. For example, the above graph would become the triple shown in Figure 3, below:

Diagram showing an RDF Triple.

As RDF is based upon the same core concepts of set theory and relational algebra, there are some foundational similarities in the technologies that an individual versed in the fundamentals of data management should easily understand.

What’s different?

While there are many similarities between ER modelling and RDF, there are some core differences as well—both technical and philosophical. In the traditional world of data modelling, the ER model can be a nearly sacred artefact. It is closely controlled by a group of data architects who carefully analyze each relationship for its correct syntax, cardinality, business definition, etc. On the logical or business side, interviews are conducted with a variety of stakeholders to provide the definition and context for each data construct and the relationships between data. On the physical side, changes are closely controlled, as the addition of a single column/field in a database can affect a large number of downstream operational systems.

The World Wide Web has a looser structure—again, both technically and philosophically. Its physical structure consists of an “anything-to-anything” networked relationship of nodes, rather than the structured format of tables and columns in the RDBMS. The creation of links to relevant information is the core driver here. As such, relationships are considered first-class constructs; in that they are represented by their own URI, not dependent solely on the parent/child data objects as in the ER world. And, in the spirit of the Web 2.0, they can be created by anyone—no longer do developers and architects need to wait for the model or schema to be created by and/or approved by a central team. This flexibility allows additional relationships to be created by those with different needs, viewpoints, and perspectives. For example, the shipping department may be more concerned about the book’s weight than its author, and a “has weight of” predicate can be easily added to refer to this additional data about the object.

The Net Result?

A very simplistic comparison then, between the RDBMS and RDF worlds (and this is simplistic—this article is by no means intended as a deep or comprehensive technical comparison) is that of flexibility over control. The traditional rigour of extensive analysis and discussion around a data model brought with it an increased confidence in the accuracy and clarity of an organization’s data. Or did it? How many traditional data analysis projects have failed after years of effort? Why is it so difficult to get consensus and buy-in on common data definitions? Perhaps a better solution is the “2.0” approach of allowing users to create their own data relationship definitions, as in the RDF method. This is the “Wikipedia problem” [possibly – I hope not - exemplified by some of the references I’ve added to this article – Ed]. Will allowing data consumers to “speak for themselves” create chaos or a more accurate and complete definition?

As the semantic web effort gains buy-in and acceptance, there will be more case histories on which to base our decision. One problem to date is that much of the technology still remains in the hands of technical architects and developers. Until RDF becomes more easily accessible to business users, it will be hard to determine whether the “power to the people” approach to data management is a successful one. We have certainly made information accessible to the average business consumer—can we now make it more relevant and understandable? This really does remain to be seen.

Donna Burbank is Director, Enterprise Modelling and Architecture Solutions at Embarcadero Technologies.

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
Preview redux: Microsoft ships new Windows 10 build with 7,000 changes
Latest bleeding-edge bits borrow Action Center from Windows Phone
Google opens Inbox – email for people too thick to handle email
Print this article out and give it to someone tech-y if you get stuck
Microsoft promises Windows 10 will mean two-factor auth for all
Sneak peek at security features Redmond's baking into new OS
UNIX greybeards threaten Debian fork over systemd plan
'Veteran Unix Admins' fear desktop emphasis is betraying open source
Entity Framework goes 'code first' as Microsoft pulls visual design tool
Visual Studio database diagramming's out the window
Google+ goes TITSUP. But WHO knew? How long? Anyone ... Hello ...
Wobbly Gmail, Contacts, Calendar on the other hand ...
DEATH by PowerPoint: Microsoft warns of 0-day attack hidden in slides
Might put out patch in update, might chuck it out sooner
Ubuntu 14.10 tries pulling a Steve Ballmer on cloudy offerings
Oi, Windows, centOS and openSUSE – behave, we're all friends here
prev story

Whitepapers

Choosing cloud Backup services
Demystify how you can address your data protection needs in your small- to medium-sized business and select the best online backup service to meet your needs.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.