Hyperconvergence: Designing for failure
I, for one, welcome your ship of FAIL
Hyperconvergence is one of those relatively new names for something that many of us having been doing for years: consolidating sprawling infrastructures into tight, largely virtualized setups that vastly reduce the number of devices one has to manage (not to mention the number of things to spend maintenance fees on, and the number of things that can go wrong).
But are all infrastructures suited to hyperconvergence? What size organisation should be looking to do it? And how do you design it to ensure it's as robust and resilient as you can achieve at an acceptable cost?
How big do I need to be?
The easy answer is that as long as you're not minuscule, you can probably benefit from hyperconvergence.
Why not the real minnows? Imagine you're a small company with a typical setup: a fileserver, a mail server and maybe an application/database server – all of which are individual physical boxes. If one of your three machines goes "pop" you lose one of your functions.
If you were to virtualize these functions onto a single, reasonably chunky physical host running your hypervisor of choice then a physical fault would cause all three machines to die. The only way to be resilient would be to have dual chunky servers and resilient storage arrays, and to buy the resilience option for the virtualization layer. This would have a price premium that almost certainly wouldn't hold water when compared to the cost of the original all-physical approach.
As long as you're big enough to be able reasonably to afford even an entry-level virtual setup with resilient storage, you're most definitely a candidate for a hyperconverged approach.
Are some infrastructures better suited than others?
In a word: yes. However the majority is pretty well suited, because the main criterion is commonality of platforms. A collection of Mac, PC, HPE, SPARC and Power kit is pretty tricky to bring under a single virtualized roof, after all, and in recent years I've come across a surprising number of heterogeneous infrastructures (particularly where Power and Intel come together with a modest scattering of SPARC).
This said, though, there's nothing preventing you from converging (say) all the Intel stuff into a common virtual platform. And regardless of the architecture you should be able pretty easily to consolidate on common storage (iSCSI is iSCSI, and Fibre Channel is Fibre Channel) and of course a common network (you can't tell me that there's much in the network that can't hang off a Gig port of a switch).
Designing for failure
So having established that it's a good thing to do, how do you construct it to maximise resilience? By following some fairly basic steps:
1. Glass half empty
I sometimes worry that my colleagues see me as a bit of a harbinger of doom. They'll show me some funky new technology or system and once I've spent the obligatory couple of minutes drooling over its flashing lights I'll immediately be asking them: “What happens if it blows a power supply?”, or “Have you tested the failover?”, or “Where's the power patch diagram – is everything independently dual-powered?”. When you're designing an infrastructure you need to consider every conceivable failure condition, including the wacky ones, because it's the weird esoteric stuff that's hardest to figure out unless you've seen it before.
2. Don't have one of anything if you can help it
If you're hyperconverging your world, you must eliminate single points of failure. A single physical host for your virtual infrastructure is a non-starter, for instance, and why would you have a single firewall? (Hint: check out item 4, below). You may be able to justify a single storage array, but only if your single premises is suitably protected against disaster and it's dual-powered and has no single component that can cause a failure on its own.
3. Dual-connect everything you can
No server should ever be designed with a single LAN connection. Enterprise-class switches are competitively priced, so even if you can only afford modest brand name stackables there's no excuse for not having LACP-trunked multiple interconnects between the modules.
Augment your server's built-in Ethernet ports with a secondary card (there's no point dual-connecting a pair of motherboard-resident LAN ports if they share a controller) and connect them to different switches. If you can use LACP between the server and the switches then great, but at the very least you should use the NIC vendor's teaming software to provide resilience.
Sometimes you can't physically dual-connect every device: you might have a firewall that only has a single power supply, for example. In those cases, have dual devices in a cluster and connect each to a separate power supply. And you do have N+1 UPS connectivity to the power bars, don't you?
4. Establish the value of the risk
I qualified my last two headings: you shouldn't have just one of anything if you can help it and dual-connect everything you can. Unless you're very big you're not made of money, and so there is a line that must be drawn somewhere (and it's usually the bean-counters that very kindly draw it for you).
Where you have a single point of failure, calculate the risk – which may be financial in terms of inability to trade, but equally it may be reputational in terms of you looking stupid if your Web site were to be offline for days. Put the risk calculations to the board and the beancounters, and get them to take some grown-up decisions. Oh, and document the decisions they hand down: you may well need to remind them (=cover your backside) when one of the single devices dies.
5. Test it to death
Never, ever scrimp on testing. So if you want to test for the physical loss of a connection, physically break the connection instead of trying to simulate it (shutting down a port from the router's command line will sometimes cause different behaviour from a physical break). Write down what you see clearly and in a logical way so that you can easily look things up in the event of a problem.
Sometimes it may not be possible to carry out a physical test (I've had storage vendors advise me in the past against pulling a working disk from an array to validate the RAID setup, for instance, and you really don't want to be red-buttoning a big blade-based chassis server for test purposes) in which case you need to research thoroughly the likely effect and even asking the vendor for advice and/or to let you talk to other customers who've had failures.
Most installations lend themselves to hyperconvergence in some way. But whatever the size of the installation, the primary design rule is very simple:
Design the infrastructure end-to-end with at least two of everything, and at least two connections between each pair of everythings, except where the financial or reputational penalty of a single item failing is far outweighed by the cost of doubling it up.