Hey, software snobs: Hardware love can set your code free
The two go hand in hand when scaling data-crunching systems
Hitting the ceiling - and punching through it
What happens when you have reached the limit of a system in terms of scaling up? All your CPU and memory slots are full and your spinning disks have been replaced with SSDs or RAM? You need to look elsewhere.
A farmer with a heavy cart can only increase the size of a single horse up to a point before resorting to multiple horses. In the same way, at some point, we have to stop trying to scale up an existing single server system and scale out to include many nodes.
This may seem like an obvious choice and, yes, we are talking about NoSQL solutions like (but absolutely not exclusively) Hadoop and MapReduce. The real challenge here is not finding or building the nodes, it is making the software run in parallel.
Sometimes the algorithmic solution will parallelise very easily; in other cases a complete redesign of the algorithms is required which in turn means a total rewrite of the code.
As an example of the former, think about a simple summation of values. The summation operation has two key properties in that it is associative (meaning that the individual elements can be operated on in any order) and it is commutative (the operations themselves can be performed in any order).
An operation that fulfils these two criteria can be run in parallel with little change in a shared nothing environment such as Hadoop and MapReduce. So there may be millions of values to be summed but they can be split into subsets which can be summed. Then the totals from these subsets can be summed and so on.
Now think about creating a top 10 sales report; here the problem is very different. The initial data set can be split into smaller sets and distributed around a cluster of nodes. The distribution key will need to be worked out beforehand so that any data that needs to be aggregated ends up at the same node (customer identifier for example) then each node can work out the top 10 customers from its own data set.
So far, so parallel. However, at some point these "local Top 10s" must be brought together as a single task to decide the actual Top 10. This final operation cannot be done in parallel and whether you have 10 nodes or 4,000 you will be using a single node out of your cluster to do this final reduce step. And, of course, we haven’t even begun to think about complexities such as data skew.
Data mining can involve very complex algorithms to fully utilise a massively parallel processing system. An example of this is a collaborative filter (or recommendation engine), which requires a good knowledge of matrix manipulation and several Map and Reduce tasks chained together to perform the complete calculation in parallel. There are many good examples of this particular algorithm available on the internet, examples well worth researching if you want a detailed description.
Problems are relative, as are the solutions: improved running times, lower cost, trying to escape vendor lock-in, need for high-availability? The answer to these will depend.
No matter what, though, the days of just buying a faster server may be over for many of us and we must look at the hardware in conjunction with the software for a better future. ®
Mark Whitehorn is a consultant in databases, data analysis, data modeling, data warehousing and BI and holds chair of analytics at the University of Dundee where he teaches, conducts research and runs a masters programme in BI. Chris Hillman is principal data scientist in the international advanced analytics team at Teradata and studying part-time for a PhD in data science at the University of Dundee, applying big-data analytics to the data produced from experimentation into the human proteome.
Nowadays, the farmer first glues on some horns and sells it to the abbatoir as a cow.
There is a reason for Software Smugness
You haven't heard of the reverse boast "only by throwing software at it" because of a very simple fact: If I can get more performance out of the same hardware, by designing an O(N) algorithm to replace an O(N^2) I am being smart. Throwing more hardware at a problem when a better algorithmic solution exists is stupid.
I have seen people use weeks of wall-clock time on a 512 core segment of a big machine, simply because their code was bad. My colleague coded the thing properly in C++ and had the code running on his desktop and finishing in a few minutes (O(2^N) vs O(N log N) if I recall correctly). Only throwing hardware at a problem is often wrong. Thinking about better algorithms is never a bad idea.
Once you have really thought about the algorithmics, then you can start throwing more hardware at it (and once you do that, you must rethink the algorithmics again, especially when doing parallel stuff). So in our massive image processing stuff (Gpixel and Tpixel), we first minimize communication and disk-access overhead, and then move to SSD or Fusion-IO stuff.
"The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry. "
— Henry Petroski
What a lot of waffle
that just boils down to:
1. Faster hardware can make stuff faster, to a point.
2. You might need to think about the algorithms you use.
Well, thanks for that wonderful insight.
As someone more or less said above, no point writing a load of code to parallelise a really inefficient algorithm and then chucking lots of hardware at it if you could replace it with a non-parallel but much more efficient algorithm.
Re: There is a reason for Software Smugness
Thinking about better algorithms is never a bad idea.
Of course it is, what is really dumb is presenting that kind of moronic statement as an axiom when in reality it is merely justification for one's own vanity. Smart people use dumb methods on occasion simply because of an appreciation of real world factors and having the common sense to employ some form of cost-benefit analysis. How many chunks of code are only ever used a small number of times, possibly even once? Think about "code" in the broadest sense of the word - it could be some one-off data manipulation job or even a simple for loop at the shell prompt.
Consider a job that we know in advance is a one-off. You use your "better" algorithms and take two hours to devise a solution that does the task in two seconds. I spend two minutes knocking something up that does the same job in another two minutes. Your solution may be "better" purely in a vanity sense but is that really the best use of resources? Remember your assertion: "Thinking about better algorithms is never a bad idea."