Hyperconverged solutions can't live without flash

Change is in the air

Racks inside Rackspace's Sydney Data Centre

In the world of hyperconverged virtualization, flash is important. It forms a big part of the hyperconvergence value proposition as vendors create distributed hybrid storage arrays from local resources.

But hyperconvergence is moving away from every node in the cluster having an identical storage/compute ratio, and this means the role of flash is changing.

As often happens in IT, exactly how hyperconverged solutions work is starting to evolve into something unfamiliar even before most of the industry has figured out the first iteration.

To understand where it is all going, let's pause to look at where we are...

Broadly defined, hyperconvergence consists of commodity servers clustered together to act as a single storage source, with virtual machine workloads running alongside the storage workloads as part of the cluster. Nearly always the storage inside the servers is partly – and sometime all – flash.

Winds of change

While it is possible to create hyperconverged solutions without flash it is generally considered impractical. Today's modern servers can cram a truly crazy amount of compute power into a single node.

I recently reviewed a very modest four-node hyperconverged cluster with 192 threads and 1TB of RAM across all four nodes, all of which fits into 4U. You can get more compute and way more RAM into those boxes than that if you try.

That is more than enough to run a good 150 quite demanding application workloads. There are eight drive slots per node. Even with 15K drives, the thought of trying to run 150 applications off of a measly 32 15K spindles is depressing.

How long would that take to boot from cold and bring up all the virtual machines? Do we even have benchmarks that can measure time in geological ages?

It is thanks to this compute density that SSDs have become something of a necessity. The higher up the stack you go, the better the SSDs you can expect to find. The really expensive appliance-based hyperconverged solutions use PCI-E SSDs as well as SATA SSDs and 15K SAS drives to make a hybrid storage solution.

Hot blocks are migrated between the different tiers of storage based on algorithms proprietary to the hyperconvergence vendor. Write-intensive blocks, for example, might live on the PCI-E SSDs, as these have the highest write life. The SATA SSDs would have frequently read but infrequently written blocks, while the "cold" blocks get demoted to the magnetic disks.

Variations exist. Some solutions have only SATA/SAS SSDs instead of PCI-E. Some will not demote or promote individual blocks, choosing instead to move whole virtual machines to different tiers of storage.

Some vendors are even running trials of completely ridiculous (and drool-worthy) solutions involving memory channel storage as the "fast" tier and host-swap-capable PCI-E attached NVMe drives. When paired with top-of-the-line CPUs and crammed to the gills with RAM the resulting 4U four-node cluster is capable of defeating multiple racks’ worth of traditional server equipment.

The constant in today's conventional hyperconverged solutions, however, is that all the nodes in a cluster – from three to 64 and beyond – are the same: same server with the same CPUs, the same amount of RAM and the same load out of drives. This is changing.

Doubling up

In your typical hyperconvergence setup data is replicated to two nodes: the node the workload is running on and one other. There can be more replicas, but let's stick with two to keep it simple.

If the node (or disk) the workload is running on fails, the node that has the replica will attempt to fire up the virtual machine in a typical virtualisation high-availability/failover fashion.

If the node containing the replica data has the resources to start that virtual machine it will do so and begin replicating a second copy elsewhere in the cluster.

If the node that contains the replica does not have enough resources then another node in the cluster will start up the virtual machine, streaming the data remotely from the node with the replica data. Eventually, the node running the virtual machine will build its own local copy.

This means that the performance of a hyperconvergence cluster is bounded not only by the speed of the storage local the node the virtual machine is working on but the speed of its replica partner and the network that binds them as well.

Workloads are increasingly designed to expect SSD-class storage performance

It also means that if workloads on one node in the cluster become more write intensive this can have knock-on effects on other nodes in the cluster, since they must now cope with additional replica traffic

To compound this, workloads are increasingly designed to expect SSD-class storage performance. The role of flash in the data centre is changing. Instead of being merely an aid to consolidation – providing the ability to cram more workloads onto your beefier nodes – it is now a requirement for individual workloads.

The result is that hybrid is not enough for the next generation of hyperconverged solutions. Unfortunately, all-flash is an expensive option. Worse: workload demands keep changing.

Room for growth

One of hyperconvergence's key selling points is that it gives you the ability to grow as you need instead of over-buying from the start. This is rendered useless if customers' needs change part way through the lifecycle of a cluster and adding more of the existing node type won't meet requirements.

Enter adaptive hyperconvergence. Instead of every node in the cluster having to be the same, you can add "specialist" nodes if you know what you are doing. In an adaptive hyperconverged cluster, nodes can be anything.

Some are classic "balanced" hyperconverged nodes. Some are all storage. Some are all compute and still more are compute plus flash caching, all mediated by the hyperconverged software stack.

Consider for a moment a cluster of 32 nodes. Two nodes are MCS+NVMe "godlike" nodes where the really sexy workloads live. Four are storage-heavy nodes that have heaps of storage and a bunch of really fast PCI-E or NVMe storage. The other 26 nodes are balanced.

If set up properly, the four storage-heavy nodes could serve as the replica servers for the whole of the cluster. They don't need a lot of RAM or much in the way of CPU. They are really just a place where a whole lot of IOPS and storage space live and they spend almost their entire existence writing change blocks to their replica copies of the virtual machines running on the other nodes in the cluster.

On the rare occasion where a disk drops out or a server dies, the storage-heavy virtual machines are specced out high enough to stream one server's worth of workloads back to the cluster while still absorbing their required writes.

How much the cluster as a whole feels the impact depends on how the IOPS and network balance is handled. If you lost an entire godlike node, you might see those workloads operate more slowly for the period of the maintenance window as some or all are moved onto balanced nodes.

Depending on how truly IOPS-bound those workloads are, they may also see an impact when migrating back to the repaired node when it returns to service.

This is a fairly simple example but it is hopefully enough to convey that hyperconverged solutions don't need to be identical throughout the cluster.

Use with care

On the SMB and midmarket side this could lead to interesting combinations. A 32-node cluster with two all-flash replica servers would leave almost all of the local node's IOPS for use by the workloads running on those nodes.

Two all-flash replica servers could provide a destination for those replicas that is entirely able to absorb the traffic from 30 conventional servers and still stream an entire node's worth of workloads in the event of a failure.

For the mid-market and the enterprise it probably means more flash in more nodes. More to the point, it probably means more flash in different form factors: MCS for the ultra low latency stuff, PCIe for high write life, NVMe for raw throughput and SAS/SATA for lower-tier workloads.

In turn, this will create pressure on hyperconvergence vendors to up their game regarding deduplication, compression and tiering algorithms. Flash is not cheap, and all the fabs put together can't produce anywhere near enough of it to meet global demand.

Not only do we need to get the most out of flash, we need to be careful that we are not using it when we don't have to. There is a heap of data in any company that is write once, read never. This doesn't need ever to pass through flash.

Better algorithms can help here but better awareness is also required. Ultimately, nothing will help so much as the ability to put an agent into individual virtual machines or talk to applications living in containers.

The key is automation. The goal is to be able to add more of a given resource (bulk storage, flash, RAM, CPU and so on) and see an improvement in the whole of the cluster.

If the hyperconvergence software can find out from the horse's mouth (the application) what is the likely use for the blocks being written, we can make even better use of our flash resources and create more efficient and custom-tailored hyperconverged solutions.

The storage automation that is possible with hyperconvergence has barely begun to be explored. What the end result will be is anyone's guess.

For the time being – and until someone comes up with something better – flash looks to be an addiction hyperconvergence is not likely to cure. ®

Sponsored: Detecting cyber attacks as a small to medium business

SUBSCRIBE TO OUR WEEKLY TECH NEWSLETTER




Biting the hand that feeds IT © 1998–2020