Microsoft vs. Teradata
Data Warehousing – there really isn't just one answer
Column Microsoft and Teradata are both significant players in the BI market but they have wildly different approaches to the challenges of extracting information from data. The reason lies in the fact that the two companies elected to solve two very different, but equally intractable, computational problems in order to get their BI systems to perform well.
Two different approaches
Business Intelligence is a complex area and generalisations are notoriously imprecise, but without generalisations discussions become book length, so let’s generalise a little.
BI is about extracting information from data. The data in most enterprises is distributed across multiple transactional systems (Finance, Sales, HR etc.) so we have to pull it into one place before we can analyse it.
A wish list for an efficient BI system looks something like this:
- Rapid movement of data from source systems to analytical system
- Easy auditing of data
- Minimum number of copies of the data
- Rapid analytical queries (2-3 seconds)
- Users presented with an ‘analytical view’ of data
As a general rule the more copies of the data, the more difficult it is to audit, so Points 2 & 3 are somewhat linked. Despite its position at number 4, rapid analytical querying is very important: an analytical query may be showing information that results from the aggregation of five billion rows in a source system, yet it must return the answer in two to three seconds. Point five covers the requirements of the end users: they certainly don’t want to see tables in a relational database; they want to work with dimensions and measures, or some near equivalent.
Given this common wish list, how did Microsoft and Teradata end up with such different strategies?
In Teradata’s world (shown on the left of Figure 1, above), the extracted and cleaned data is placed in a central store, known as an Enterprise Data Store or (these days) Enterprise Data Warehouse (EDW). There it is held as a relational structure and all the analytical queries are run directly against the data in the EDW.
In the Microsoft world (the right hand side of Figure 1), data is placed in a central store or data warehouse which is also typically structured as relational tables. However subsets of the data are then moved from the warehouse into data marts, restructured as multi-dimensional data, and it is against these data marts that queries are run.
These two approaches are radically different because the two companies have chosen to solve the overall problem of BI by solving two different computational problems – both of which have been serious thorns in the side of commercial computing since the mid 1980s.
The age-old problem Teradata addressed is simple to express – it is very difficult to run fast analytical queries against a relational structure.
Teradata solved this problem using a mix of parallel hardware and innovative software, not only solving the problem for small data sets but providing a solution that scales to truly massive data sets.
Once you solve this problem, then a side effect is that you can keep the BI structure very simple. In turn, that means that the majority of the wish list is automatically satisfied; indeed points 1 - 4 are natural side effects of the solution.
The data only moves once, so the delays are minimised. Only two copies of the data are held, one in the originating source systems and one in the EDW, so auditing is about as easy as it is going to get.
And the final wish list point? In order to hide the complexity of the relational store, Teradata has placed a logical layer between the user and the EDW or EDS data structure (see Figure 2, below). This translates the relational views of the data into analytical views so the users never have to see the relational structure.
The problem that Microsoft elected to solve was that of producing an efficient multi-dimensional database engine that was fast and also cured the OLAP data explosion problem. This is another non-trivial problem but solving it, and using the resulting technology in the data marts, automatically solves Points 4 & 5 in our wish list. The data can be aggregated and that gives the blistering speed that’s required. In addition, multi-dimensional data means that users automatically get a hierarchical, dimensional and measured view of the data.
On the other hand, Microsoft’s approach means that you essentially accept that load times will be slower and auditing more of a challenge because of the proliferation of extra copies of data in the data marts. You also accept that the process will burn up more disk space.
However, supporters of this approach argue that the first three wish list points are not, in practice, much of an issue. Disk space and CPU cycles are cheap, auditing can be automated and that Microsoft is developing techniques such as proactive caching that essentially compensate for the delays in organising the data, bringing real-time analysis ever closer.
So, which is better?
One point is reasonably clear. If you have a need for a BI system that holds an awesomely large set of data, you will certainly be talking to Teradata. The company can field an impressive list of customers in the ‘monstrously, overwhelmingly, huge’ category. So, if we are simply going to rate the two strategies on ‘My BI system can be bigger than yours’ then Teradata wins.
But such a rating is nonsense for most enterprises. By definition, the average enterprise has an average BI requirement and both Microsoft and Teradata can provide a solution here. (Actually, assuming the skewed distribution that probably exists, we could even say that the modal company has a below-average requirement, but let’s not get picky). So both of these BI vendors have an appropriate technical solution for most companies and in practice, there seems genuinely to be very little overlap. Hermann Wimmer (Teradata’s Vice President of EMEA) told me that Teradata tends to focus only on the largest companies. Microsoft’s mantra has, for years, been “BI for the masses”.
In terms of the technologies, it is tempting to extrapolate that Microsoft couldn’t solve the problem of analytical access to relational data and therefore chose to ‘work around’ it. This is doubtless an oversimplification because, whilst it is true that this particular problem is known to be difficult to solve, it was also known to be soluble by the time Microsoft took a serious commercial interest in BI (Teradata had already done so). So, given its huge resources, Microsoft could have cracked the problem. In the same way, I have no doubt that Teradata could ‘do’ a multi-dimensional database engine if it elected to address the problem.
In addition, Teradata’s systems have always been ‘reassuringly expensive’. So Microsoft may well have rejected the highly specialised solution (that works for all conceivable sizes of data) and elected to pursue a line that offers a much more cost-effective solution for the majority of potential customers.
The bottom line is that while Teradata solution fits all, and sometimes may be the only feasible solution; Microsoft’s is likely to be much more cost effective for the majority.
I am quite well aware that the relational model is a logical model and that it is therefore nonsense to imply that relational structures are inherently slow for the simple reason that the model says nothing about implementation on disk. The reason for the poor analytical performance of relational systems lies in the way that most RDBMS engine designers have elected to store their data structures on disk; it doesn’t lie with the relational model itself. Nevertheless, it remains true that on comparable hardware, analytical access to multi-dimensional data is usually orders of magnitude faster than the same access to data stored in the current crop of mainstream relational engines.