Building a data warehouse on parallel lines

Original URL: https://www.theregister.com/2006/12/28/kognitio_data_warehouse/

Kognitio ergo something-for-nothing?

Posted in Software, 28th December 2006 13:00 GMT

Never look a gift horse in the mouth, especially if there are many of them running in parallel…

There are various structures we can use in a data warehouse – each with its pros and cons. For example, if you use a relational structure for the core of the warehouse then you gain very high flexibility but lose out on speed. Flexibility ensures that you can ask any question of the data and that you can drill down to the leaf level data - but the potentially poor performance is always a pain. You can index the structure but time and disk space usually limit the number of indexes you can apply which in turn reduces the flexibility. As the users’ analytical requirements change, so you need to update the indexing strategy which is often complex and expensive. You can, of course, elect to use a different structure, perhaps a dimensional one, whereupon you gain speed but lose more flexibility.

Speed or flexibility, flexibility or speed? It’s often a difficult call because most of the time we need both. If you find yourself in this situation then some charming guys at Kognitio are amazingly, mind-bogglingly eager to talk to you because they believe that this is precisely what their product WX2 promises. You, on the other hand, are cynically aware that promises are easy and that if there were a simple solution, someone would have thought of it years ago.

In fact, they did. We’ve know for years that parallel processing and in-memory data processing are both mind-boggling fast; the problem has always been one of cost-effective implementation. WX2 is an RDBMS (Relational DataBase Management System) implemented as a MPP (Massively Parallel Processing) system built out of commodity servers, typically blades. These blades form the nodes in what is called a VDA (Virtual Data Appliance). (Well, you didn’t expect to get to grips with a whole new technology without having to learn a whole new abbreviation did you?) Each node consists of one or more CPUs, a block of memory and some disks. The nodes don’t share resources so this is a shared-nothing architecture.

How does it work? Well, imagine a VDA with eight nodes. The data for analysis is distributed evenly (and randomly) across the disks in all eight nodes. Data can be loaded and then queried in parallel but happily, the software handles all of this automatically, so developers working with Kognitio are not required to think in parallel. As soon as the load completes, the data is available for querying; there is no pause while indexes are created for the simple reason that WX2 doesn’t use any. Instead, it manages to perform all of the queries in memory.

If a simple query comes in that only touches data from one node, then that node handles the query. Now imagine that a query comes in that requires (as most are likely to) data from several nodes. The data is read from the appropriate nodes and copied to the memory on one of the nodes, which then processes the data and returns the answer set. As we said above, this isn’t a new idea either; everyone knows that RAM is much, much faster than disk. The problem has always been to find an effective algorithm that can balance the massive storage capacity of disks against the speed of RAM, ensuring that the data is available for ad hoc querying as rapidly as possible. The trick is not the overall idea, it is the implementation. The Devil, as they say, is in the detail.

In addition, the architecture that Kognitio has elected to use has a very desirable side-effect: scalability. The company claims, for example, that “the query performance of a 100-server WX2 system with 10TB of data will be the same as that of a 10-server system with 1TB of data.” To put that another way, if you have a 20 node system which performs well with 100 users, then a 40 node system will perform equally well with 200 users. If you want a third way of looking at this, you can simply add nodes to compensate for more data, more users, or to gain performance. The company claims that its architecture means that there is no measurable overhead as nodes are added, because the “WX’s fully parallel architecture produces true linear scalability.”

There are, of course, already ways of achieving both speed and flexibility. We can, for example, create a relational data warehouse and a set of dimensional data marts. Kognitio argues that this is fine in some cases but that many companies find the solution too baroque. For a start, they need to employ developers for both relational and dimensional databases and in addition, this solution involves multiple copies of the analytical data, which makes auditing a nightmare.

And Kognitio, of course, isn’t the only company that is offering a novel implementation of data warehousing. Check out, for example Netezza and DATallegro [but remember that Kognitio is available as software-only – Ed].

Ultimately, all of these products break the "traditional" way in which data warehouses are built. Kognitio is aware that it can preach as much as it likes, but developers are always (and quite rightly) sceptical. So it has created testing facilities where it “will build you a data warehouse for free and let you analyse your data in days” – which you can find here. So, in this case at least, thinking outside the box doesn’t have to cost you anything but time.®

Building a data warehouse on parallel lines

Related stories