IS there anything sexier than hyper-converged systems?
First Map Reduce, then Hive/Impala, but what’s next?
Comment Everyone likes hyper-converged systems. They are cool, dense, fast, energy-saving, agile, scalable, manageable, easy-to-use, and whatever else you want. But you know what? They have their limits too.
They are good for average workloads, a large range of workloads indeed, but not for those workloads that need huge amounts of a specific resource to the detriment of others, such as Big Data, for example.
Data grows (steadily ... and exponentially) and nothing gets thrown away. Since data adds up, the concept of the “data lake” has taken shape. Even systems created for big data are starting to sense this problem and system architects are beginning to think differently about storage.
I’m going to look at Hadoop because it gives a good example of a hyper-converged infrastructure.
Today, most Hadoop clusters are built on top of an HDFS (Hadoop Distributed File System). HDFS characteristics make this filesystem much cheaper, reliable, and more scalable than many other solutions but, at the same time, it’s limited by the cluster design itself.
CPU/RAM/network/capacity ratios are important to design the best balanced systems, but things change so rapidly that what you have implemented today could become very inefficient tomorrow. I know that we are living in a very commodity-hardware-world right now, but despite the short lifespan of modern hardware I’m not convinced that enterprises are willing to change their infrastructures (and spend boatloads of money) very often.
Look at what's happening. Two years ago it was all about Map Reduce, then it was all about Hive/Impala and the like, now it’s all about Spark and other in-memory technologies. What’s next?
Whatever, my first question is: “Can they run on the same cluster?”
Yes, of course, because the underlying infrastructure, now Hadoop 2.6, has evolved as well.
But the real question is: “Can they run with the same level of efficiency on the same two-year-old cluster?” Mmmm, probably not.
And then another question arises: “Can you update that cluster to meet the new requirements?”
Well, this is a tough one to answer. Capacity grows but you don’t normally need to process all the data at the same time while, on the other hand, applications, business needs, and workloads change very quickly, making it difficult to build a hyper-converged cluster and serve them all efficiently.
Things get even more complicated if the big data analytics cluster becomes an enterprise-wide utility. Classic Hadoop tools are not the only ones, and many departments in your organisation have different views and need to make different analyses on different data sets (which often come from the same raw data); it’s one of the advantages of a data lake.