Feeds

Building a data warehouse on parallel lines

Kognitio ergo something-for-nothing?

Intelligent flash storage arrays

Never look a gift horse in the mouth, especially if there are many of them running in parallel…

There are various structures we can use in a data warehouse – each with its pros and cons. For example, if you use a relational structure for the core of the warehouse then you gain very high flexibility but lose out on speed. Flexibility ensures that you can ask any question of the data and that you can drill down to the leaf level data - but the potentially poor performance is always a pain. You can index the structure but time and disk space usually limit the number of indexes you can apply which in turn reduces the flexibility. As the users’ analytical requirements change, so you need to update the indexing strategy which is often complex and expensive. You can, of course, elect to use a different structure, perhaps a dimensional one, whereupon you gain speed but lose more flexibility.

Speed or flexibility, flexibility or speed? It’s often a difficult call because most of the time we need both. If you find yourself in this situation then some charming guys at Kognitio are amazingly, mind-bogglingly eager to talk to you because they believe that this is precisely what their product WX2 promises. You, on the other hand, are cynically aware that promises are easy and that if there were a simple solution, someone would have thought of it years ago.

In fact, they did. We’ve know for years that parallel processing and in-memory data processing are both mind-boggling fast; the problem has always been one of cost-effective implementation. WX2 is an RDBMS (Relational DataBase Management System) implemented as a MPP (Massively Parallel Processing) system built out of commodity servers, typically blades. These blades form the nodes in what is called a VDA (Virtual Data Appliance). (Well, you didn’t expect to get to grips with a whole new technology without having to learn a whole new abbreviation did you?) Each node consists of one or more CPUs, a block of memory and some disks. The nodes don’t share resources so this is a shared-nothing architecture.

How does it work? Well, imagine a VDA with eight nodes. The data for analysis is distributed evenly (and randomly) across the disks in all eight nodes. Data can be loaded and then queried in parallel but happily, the software handles all of this automatically, so developers working with Kognitio are not required to think in parallel. As soon as the load completes, the data is available for querying; there is no pause while indexes are created for the simple reason that WX2 doesn’t use any. Instead, it manages to perform all of the queries in memory.

If a simple query comes in that only touches data from one node, then that node handles the query. Now imagine that a query comes in that requires (as most are likely to) data from several nodes. The data is read from the appropriate nodes and copied to the memory on one of the nodes, which then processes the data and returns the answer set. As we said above, this isn’t a new idea either; everyone knows that RAM is much, much faster than disk. The problem has always been to find an effective algorithm that can balance the massive storage capacity of disks against the speed of RAM, ensuring that the data is available for ad hoc querying as rapidly as possible. The trick is not the overall idea, it is the implementation. The Devil, as they say, is in the detail.

In addition, the architecture that Kognitio has elected to use has a very desirable side-effect: scalability. The company claims, for example, that “the query performance of a 100-server WX2 system with 10TB of data will be the same as that of a 10-server system with 1TB of data.” To put that another way, if you have a 20 node system which performs well with 100 users, then a 40 node system will perform equally well with 200 users. If you want a third way of looking at this, you can simply add nodes to compensate for more data, more users, or to gain performance. The company claims that its architecture means that there is no measurable overhead as nodes are added, because the “WX’s fully parallel architecture produces true linear scalability.”

There are, of course, already ways of achieving both speed and flexibility. We can, for example, create a relational data warehouse and a set of dimensional data marts. Kognitio argues that this is fine in some cases but that many companies find the solution too baroque. For a start, they need to employ developers for both relational and dimensional databases and in addition, this solution involves multiple copies of the analytical data, which makes auditing a nightmare.

And Kognitio, of course, isn’t the only company that is offering a novel implementation of data warehousing. Check out, for example Netezza and DATallegro [but remember that Kognitio is available as software-only – Ed].

Ultimately, all of these products break the "traditional" way in which data warehouses are built. Kognitio is aware that it can preach as much as it likes, but developers are always (and quite rightly) sceptical. So it has created testing facilities where it “will build you a data warehouse for free and let you analyse your data in days” – which you can find here. So, in this case at least, thinking outside the box doesn’t have to cost you anything but time.®

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
Preview redux: Microsoft ships new Windows 10 build with 7,000 changes
Latest bleeding-edge bits borrow Action Center from Windows Phone
Google opens Inbox – email for people too thick to handle email
Print this article out and give it to someone tech-y if you get stuck
Microsoft promises Windows 10 will mean two-factor auth for all
Sneak peek at security features Redmond's baking into new OS
UNIX greybeards threaten Debian fork over systemd plan
'Veteran Unix Admins' fear desktop emphasis is betraying open source
Entity Framework goes 'code first' as Microsoft pulls visual design tool
Visual Studio database diagramming's out the window
Google+ goes TITSUP. But WHO knew? How long? Anyone ... Hello ...
Wobbly Gmail, Contacts, Calendar on the other hand ...
DEATH by PowerPoint: Microsoft warns of 0-day attack hidden in slides
Might put out patch in update, might chuck it out sooner
Redmond top man Satya Nadella: 'Microsoft LOVES Linux'
Open-source 'love' fairly runneth over at cloud event
prev story

Whitepapers

Choosing cloud Backup services
Demystify how you can address your data protection needs in your small- to medium-sized business and select the best online backup service to meet your needs.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.