Gold in the BI hills
Savvy developers are keeping beady eyes on the BI job market
Comment Organisations are rarely short of data, but the information it contains is often elusive.
Business Intelligence (BI) gives the business user an amazing tool: it turns data into information, making BI an area of huge growth and one where skilled developers are in short supply. It is worth knowing the job roles in a standard BI project so that you can see where you fit in and can secure a lucrative (sorry, useful and effective) role on a BI project.
The first task in any BI system is pulling data from a variety of disparate source systems (Finance, HR, Sales etc), cleaning it, standardising it, and finally loading it into a central repository (often known as a data warehouse). This whole process is known as ETL, Extract, Transform and Load. A new and crucial role, therefore, is that of ETL Developer.
There are many tools for this work: Integration Services within Microsoft’s SQL Server 2005, ETI Solution and Cognos DecisionStream to name but three. Apart from knowing how to drive a particular ETL application (and once you’re familiar with one, conversion to the next is relatively easy) you need to be good at devising the algorithms that will effectively clean a particular set of data.
Proficiency with SQL and the ability to use it imaginatively are vital as you tease out inconsistencies, duplications and other horrors from source system data. It’s a fascinating challenge, especially to anyone with a penchant for problem-solving: I love it. There is no step-by-step ‘do this, this and this and the data’ll all be fine’ approach available for data cleaning; instead you build up a collection of techniques which you apply as and when seems appropriate. If this all sounds a tad informal and lacking rigour, that’s because data cleaning is as much an art as a science.
Before the source data can enter the central repository, someone (or probably a committee of someones) has to agree a set of terminology and business rules to which the cleaned data must comply. It’s common for different systems within an organisation to use the same term to mean different things, and the differences can be fundamental. What is a ‘product’ within the organisation? One that’s manufactured in-house? One you buy in? Both? Neither?
Business rules describe limits or processes: for instance, orders for less than £500 can never attract more than five per cent discount. Without an all-pervasive set of definitions to which all data must conform, information derived from the central data store will lack accuracy and consistency.
Once a set of rules is agreed, cleaned data from the sources are ‘transformed’ in order to meet those rules: achieving this consistency is likely to involve tweaking data types, extensive use of look-up tables and domain checking. Finally the data is ready for loading into the data warehouse.
Once the data warehouse has been constructed, it requires maintenance. The data in the data warehouse is refreshed on a scheduled basis, nightly in many cases, to ensure that analysis is performed against the most up-to-date set of data possible. As a business evolves, changes are inevitable in the source data so new or re-written ETL routines will be required from time to time.
Ultimately the data warehouse is a just complex database and the clues for the next job roles are right there. All databases need to be developed (so BI systems need DBDs) and, once they have been developed, databases need to be administered (DBAs).
Another role is that of deploying BI solutions as OLAP (On-Line Analytical Processing) cubes. The data in OLAP cubes is not held in familiar relational tables but based on a way of structuring data that involves denormalised data held in dimensional tables which are grouped around a central fact table. The structure is often described as a star schema. The resulting cube of multi-dimensional data is ideal for analysis: it’s dedicated to servicing analysis and rapid response times make it possible for the user to follow ‘train of thought’ analysis paths. Data can be inspected from as many angles as there are dimensions: you might look at sales over time, or by geographical region, by product (now you know what one of those is) or outlet or sales personnel, or indeed all of those or any subset thereof.
Think about DBDs for a minute. They need to understand the relational structure, normalisation, indexing, referential integrity and so on, and also the language for manipulating relational data, SQL.
Developers who are designing and creating OLAP cubes will often need to extract data from relational tables, so they need to have all of the skills of a DBD. In addition they have to be able to think about, and manipulate data, as multi-dimensional sets.
The fundamental difference is that relational data is unordered while multi-dimensional data is highly ordered. This doesn't sound like much of a difference, but it is huge. Think about a very simple problem. A user asks to see the sales figures for two different product groups over a year but displayed as a monthly running total.
A simple (and perfectly reasonable request) that happens to be very difficult to deliver using pure SQL because SQL knows nothing about the order of the data. When we put data into a multi-dimensional cube, it is perfectly ordered so the problem is trivial to solve, as long as we have a language (essentially an equivalent of SQL) that allows us to manipulate multi-dimensional data.
That language is MDX (Multi-Dimensional eXpressions) – and knowledge of MDX is a very saleable skill. With MDX you can manipulate data (for tasks such as calculating sums, averages and so on) and also query it with commands that ‘understand’ the hierarchically-ordered data in the cube. For example, the MDX to create the running total shown above is:
So for developers, the business intelligence is suggesting that BI is not only here to stay but is also presenting a host of opportunities. Now is the time to take the plunge. ®