Feeds

Building a data warehouse on parallel lines

Kognitio ergo something-for-nothing?

Choosing a cloud hosting partner with confidence

Never look a gift horse in the mouth, especially if there are many of them running in parallel…

There are various structures we can use in a data warehouse – each with its pros and cons. For example, if you use a relational structure for the core of the warehouse then you gain very high flexibility but lose out on speed. Flexibility ensures that you can ask any question of the data and that you can drill down to the leaf level data - but the potentially poor performance is always a pain. You can index the structure but time and disk space usually limit the number of indexes you can apply which in turn reduces the flexibility. As the users’ analytical requirements change, so you need to update the indexing strategy which is often complex and expensive. You can, of course, elect to use a different structure, perhaps a dimensional one, whereupon you gain speed but lose more flexibility.

Speed or flexibility, flexibility or speed? It’s often a difficult call because most of the time we need both. If you find yourself in this situation then some charming guys at Kognitio are amazingly, mind-bogglingly eager to talk to you because they believe that this is precisely what their product WX2 promises. You, on the other hand, are cynically aware that promises are easy and that if there were a simple solution, someone would have thought of it years ago.

In fact, they did. We’ve know for years that parallel processing and in-memory data processing are both mind-boggling fast; the problem has always been one of cost-effective implementation. WX2 is an RDBMS (Relational DataBase Management System) implemented as a MPP (Massively Parallel Processing) system built out of commodity servers, typically blades. These blades form the nodes in what is called a VDA (Virtual Data Appliance). (Well, you didn’t expect to get to grips with a whole new technology without having to learn a whole new abbreviation did you?) Each node consists of one or more CPUs, a block of memory and some disks. The nodes don’t share resources so this is a shared-nothing architecture.

How does it work? Well, imagine a VDA with eight nodes. The data for analysis is distributed evenly (and randomly) across the disks in all eight nodes. Data can be loaded and then queried in parallel but happily, the software handles all of this automatically, so developers working with Kognitio are not required to think in parallel. As soon as the load completes, the data is available for querying; there is no pause while indexes are created for the simple reason that WX2 doesn’t use any. Instead, it manages to perform all of the queries in memory.

If a simple query comes in that only touches data from one node, then that node handles the query. Now imagine that a query comes in that requires (as most are likely to) data from several nodes. The data is read from the appropriate nodes and copied to the memory on one of the nodes, which then processes the data and returns the answer set. As we said above, this isn’t a new idea either; everyone knows that RAM is much, much faster than disk. The problem has always been to find an effective algorithm that can balance the massive storage capacity of disks against the speed of RAM, ensuring that the data is available for ad hoc querying as rapidly as possible. The trick is not the overall idea, it is the implementation. The Devil, as they say, is in the detail.

In addition, the architecture that Kognitio has elected to use has a very desirable side-effect: scalability. The company claims, for example, that “the query performance of a 100-server WX2 system with 10TB of data will be the same as that of a 10-server system with 1TB of data.” To put that another way, if you have a 20 node system which performs well with 100 users, then a 40 node system will perform equally well with 200 users. If you want a third way of looking at this, you can simply add nodes to compensate for more data, more users, or to gain performance. The company claims that its architecture means that there is no measurable overhead as nodes are added, because the “WX’s fully parallel architecture produces true linear scalability.”

There are, of course, already ways of achieving both speed and flexibility. We can, for example, create a relational data warehouse and a set of dimensional data marts. Kognitio argues that this is fine in some cases but that many companies find the solution too baroque. For a start, they need to employ developers for both relational and dimensional databases and in addition, this solution involves multiple copies of the analytical data, which makes auditing a nightmare.

And Kognitio, of course, isn’t the only company that is offering a novel implementation of data warehousing. Check out, for example Netezza and DATallegro [but remember that Kognitio is available as software-only – Ed].

Ultimately, all of these products break the "traditional" way in which data warehouses are built. Kognitio is aware that it can preach as much as it likes, but developers are always (and quite rightly) sceptical. So it has created testing facilities where it “will build you a data warehouse for free and let you analyse your data in days” – which you can find here. So, in this case at least, thinking outside the box doesn’t have to cost you anything but time.®

Intelligent flash storage arrays

More from The Register

next story
Netscape Navigator - the browser that started it all - turns 20
It was 20 years ago today, Marc Andreeesen taught the band to play
UNIX greybeards threaten Debian fork over systemd plan
'Veteran Unix Admins' fear desktop emphasis is betraying open source
Sign off my IT project or I’ll PHONE your MUM
Honestly, it’s a piece of piss
Return of the Jedi – Apache reclaims web server crown
.london, .hamburg and .公司 - that's .com in Chinese - storm the web server charts
Chrome 38's new HTML tag support makes fatties FIT and SKINNIER
First browser to protect networks' bandwith using official spec
Admins! Never mind POODLE, there're NEW OpenSSL bugs to splat
Four new patches for open-source crypto libraries
Torvalds CONFESSES: 'I'm pretty good at alienating devs'
Admits to 'a metric ****load' of mistakes during work with Linux collaborators
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Cloud and hybrid-cloud data protection for VMware
Learn how quick and easy it is to configure backups and perform restores for VMware environments.
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.