Hey, software snobs: Hardware love can set your code free
The two go hand in hand when scaling data-crunching systems
Comment In computing there are many, many different ways to run down other people’s work, not the least of which is: “OK, so they removed the bottleneck, but only by throwing faster hardware at it.”
The implication is that tackling an issue just with software is intrinsically better. When did you ever hear anyone say: “OK, so they removed the bottleneck, but only by better coding?"
The truth is that solving computing problems always involves both hardware and software; the trick is not to look for specific kinds of solution, but to find the most effective one. Of course "effective" in analytical terms can be defined in a host of different ways – cost, speed of implementation, reliability and so on.
And to make matters more complex the most effective solution to a given problem will vary with time. For example when Relational Online Analytical Processing (ROLAP) was all the rage, many star schemas were usually woefully poorly indexed. So the software fix of applying sensible indices was very effective – and much better than merely throwing hardware at the problem. These days, a more effective solution might be to chuck in a solid-state drive (SSD).
There can be no doubt that SSD technology will speed up the I/O of a system, many times over in some cases, and can be spectacularly better than spinning disk in terms of random reads and writes. And as SSDs continue to plummet in price and get faster, we expect to see them used more and more to replace delicate software hand-tuning such as indices, calculated redundant columns, horizontally and/or vertically splitting of tables and so on.
And, to switch to another valid measure of efficiency, we would argue that SSDs generally require less maintenance than software and fewer design tweaks like these and are therefore better.
This is not to suggest you should abandon software/design changes and use SSDs in all analytical systems, but it does illustrate the point that “throwing hardware” at a problem can be the correct solution.
Quick wins for accelerating the performance of software can be found by increasing the speed and capacity of the system memory. True, it’s volatile, so if the plug is pulled you lose all that lovely data, but analytical systems work almost exclusively on copies of the data. Memory capacity continues to rise as prices continue to drop and systems with many gigabytes of main memory are now common.
On the other hand, the volume of data continues to grow exponentially and we know that a great deal of data is accessed only rarely, so memory, while very useful, is never going to be the only answer.
For years we have got away with relying on CPUs acquiring ever-increasing numbers of transistors (just ask Gordon Moore) and becoming faster and faster. Recently however we have hit a hard limit in the speed of CPU cores and are resorting to using more of them to allow parallel processing. Resourceful types are finding good use cases for GPUs, chips architected originally for graphics processing, as these are designed to perform mathematical operations at high speed across many cores.
Hitting the ceiling - and punching through it
What happens when you have reached the limit of a system in terms of scaling up? All your CPU and memory slots are full and your spinning disks have been replaced with SSDs or RAM? You need to look elsewhere.
A farmer with a heavy cart can only increase the size of a single horse up to a point before resorting to multiple horses. In the same way, at some point, we have to stop trying to scale up an existing single server system and scale out to include many nodes.
This may seem like an obvious choice and, yes, we are talking about NoSQL solutions like (but absolutely not exclusively) Hadoop and MapReduce. The real challenge here is not finding or building the nodes, it is making the software run in parallel.
Sometimes the algorithmic solution will parallelise very easily; in other cases a complete redesign of the algorithms is required which in turn means a total rewrite of the code.
As an example of the former, think about a simple summation of values. The summation operation has two key properties in that it is associative (meaning that the individual elements can be operated on in any order) and it is commutative (the operations themselves can be performed in any order).
An operation that fulfils these two criteria can be run in parallel with little change in a shared nothing environment such as Hadoop and MapReduce. So there may be millions of values to be summed but they can be split into subsets which can be summed. Then the totals from these subsets can be summed and so on.
Now think about creating a top 10 sales report; here the problem is very different. The initial data set can be split into smaller sets and distributed around a cluster of nodes. The distribution key will need to be worked out beforehand so that any data that needs to be aggregated ends up at the same node (customer identifier for example) then each node can work out the top 10 customers from its own data set.
So far, so parallel. However, at some point these "local Top 10s" must be brought together as a single task to decide the actual Top 10. This final operation cannot be done in parallel and whether you have 10 nodes or 4,000 you will be using a single node out of your cluster to do this final reduce step. And, of course, we haven’t even begun to think about complexities such as data skew.
Data mining can involve very complex algorithms to fully utilise a massively parallel processing system. An example of this is a collaborative filter (or recommendation engine), which requires a good knowledge of matrix manipulation and several Map and Reduce tasks chained together to perform the complete calculation in parallel. There are many good examples of this particular algorithm available on the internet, examples well worth researching if you want a detailed description.
Problems are relative, as are the solutions: improved running times, lower cost, trying to escape vendor lock-in, need for high-availability? The answer to these will depend.
No matter what, though, the days of just buying a faster server may be over for many of us and we must look at the hardware in conjunction with the software for a better future. ®
Mark Whitehorn is a consultant in databases, data analysis, data modeling, data warehousing and BI and holds chair of analytics at the University of Dundee where he teaches, conducts research and runs a masters programme in BI. Chris Hillman is principal data scientist in the international advanced analytics team at Teradata and studying part-time for a PhD in data science at the University of Dundee, applying big-data analytics to the data produced from experimentation into the human proteome.