Original URL: http://www.theregister.co.uk/2006/01/26/sql_server_bi/
SQL Server 2005 Business Intelligence
Would Sir like BI with his database?
Business Intelligence (BI) is an umbrella term for systems and processes that turn a mass of opaque data into useful business information. SQL Server 2005 incorporates radical changes into its BI capabilities.
BI typically involves the construction of a data warehouse which pulls together disparate data held in different data sources within (and even outside) an organisation. Constructing a warehouseful of clean, coherent data is not easy. Not only is it hard to build, it’s also hard to maintain. Instead of adding further bolt-on tools, for 2005 the SQL Server development team stood back and attempted to solve the fundamental problems of data warehouse design and development. The UDM (Unified Dimensional Model) and Pro-active Caching are the results.
The UDM is essentially a layer that sits between the users and the source systems. This layer models the analyses that users wish to perform on the data.
The traditional approach to warehousing is to perform nightly extracts from the source systems and restructure that data into OLAP (On Line Analytical Processing) cubes. The cubes can be accessed very rapidly but the data is never exactly current. In SQL Server 2005 the caching system has been totally redesigned and is much more flexible. OLAP cubes can now act as a ‘pro-active’ cache - the data therein can be refreshed from the source systems much more frequently than every 24 hours. The initial structure of the pro-active cache is automatically determined by the UDM and can be tuned to a desirable balance between query performance and the latency of data in the cache. The cache itself has a degree of intelligence built in, which enables it to learn from the queries that users run against it and respond by modifying the structure of the cache.
The UDM and Pro-active Caching bring the goal of real-time data warehousing into the realms of the possible from the badlands of the difficult.
When the DTS (Data Transformation Services) came out in SQL Server 7 in 1998, it was a good tool with a usable UI. Over the years data volumes have increased and latterly the DTS’s scalability has been seriously called into question. In addition, as we have continued to learn more about how to extract, transform and load data, its ETL abilities started to look long in the tooth. By the year 2005 it was positively walrus-like.
For SQL Server 2005, Microsoft scrapped the DTS altogether and started again. One major change is the new name - Integration Services (more information here). Another is that the new product explicitly separates the data control from data flow. There are separate environments for designing tasks that determine data control and those that determine data flow, improving both the design process and later maintenance. Package variables can now be defined, and can be inspected and scoped to a package, task or loop. All packages are stored in SQL Server or as XML in the file system, so that they can be managed under source control.
A DTS 2000 package migration wizard can help move existing packages to the new version, issuing warnings where problems are encountered. For those hard-to-migrate packages the DTS runtime (part of SQL Server 2005) runs old-style packages without upgrading.
SQL Server 2000 came with two data mining algorithms (Decision Trees and Clustering) as part of the deal. Way, way back then it was astonishing not to have to hand over large wodges of extra cash for such data mining tools. The algorithms themselves were good, although the UI for visualising the results they produced was crude and scalability was something of an issue. However, it was hard to complain overmuch given that you had access to two algorithms for nothing when other manufacturers were charging megabucks.
Joining these two in SQL Server 2005 are five further flavours of data mining:
Association rules: find rules in data that say This and That lead to The Other. One major application is analysis of web site usage.
Naïve Bayes: for classifying data and highlighting differences between separate groups, for instance between house owners and house renters.
Neural Nets: finds complex patterns and relationships in data. It runs relatively slowly but may turn up patterns that are not identified by other algorithmic approaches.
Sequence Clustering: for analysing a series of events and predicting the likely next event in the series.
Time Series: for investigating time-based data. Sales data is a classic example: it can tell whether the purchase of Product X will lead to a purchase of Product Z.
When Reporting Services was released during the beta programme it was greeted with such rapture that a version was released for SQL Server 2000. The 2005 version is a major enhancement to the BI toolset with support for ad hoc querying.
WOW! It all sounds great!
The UDM, Integration Services, new mining algorithms – it all sounds too good to be true. Well, it isn’t, Microsoft is genuinely including world class BI software with certain versions of SQL Server. But there is a problem, particularly for those who have already deployed SQL Server 2000 BI solutions.
BI in SQL Server 2005 is like a foreign country, they do things differently there. By which I mean that Microsoft hasn’t just provided a better ETL tool and a new set of mining algorithms. By introducing the UDM and pro-active caching it has totally re-written the way in which we solve BI problems.
Why didn’t the big M just leave well enough alone? The answer is that the old ways were developed during the late 1980 and ‘90s. Since then we have learnt a great deal more about BI. In addition, the concept of proactive caching has been around since that time. It has remained unimplemented commercially up until now because the algorithms required to update a MOLAP (Multidimensional Online Analytical Processing - traditional OLAP) cube in real time are necessarily very complex and, in addition, proactive caching soaks up huge resources in terms of CPU cycles and memory.
Indeed, cynics might suggest that it is just as well that Microsoft has taken five years to produce this version of SQL Server – the huge delay has allowed CPU development to catch up and RAM costs to drop to the point where we can just about afford to use it.
So, bear in mind that if you are an existing BI developer you have a great deal of conceptual material to absorb before you can deploy an effective BI solution using AQL Server 2005. In addition, if you take an existing BI solution and upgrade it without thought, you will end up with a SQL Server 2000 BI solution that happens to run under SQL Server 2005. You will have undergone all of the pain of a migration and you will reap essentially zero reward. An effective upgrade of an existing application will require a total redesign (and probably a new, bigger, box to provide all of those extra cycles).
The BI capabilities of SQL Server 2005 form the most important part of Microsoft’s strategy to take over the world, at least in database terms. Ten years ago BI was an incredibly costly game: everything was expensive - the ETL tools, the data mining algorithms and the multi-dimensional database engine. Now, if you happen to be buying SQL Server 2005 as a database engine, the BI tools are free in the box (depending on the version you buy). For many companies this is likely to be a very tempting offer. ®