CIOs, what does your nightmare before Christmas look like?
Graveyards are full of IT pros once thought irreplaceable
CIO Manifesto We gathered 14 of the UK’s finest IT leaders in a secure bunker (elegant room in the Soho Hotel -Ed.) for the last Register Round Table of 2015 to hear their tales of when good IT goes bad.
The short version is the thing they fear most is you, dear reader, your screw-ups, your documentation, your thefts, your dodgy code, your hysterical demands for even more money... as well as the mess you leave behind when you quit.
An early example of leaving chaos in your wake was the highly specialist programmer who had built a critical system. Being a good little developer, he had not only commented the code, but provided full documentation.
Yes, before you ask, the variable names were in Polish as well, else it would have been silly.
Apparently the documentation had been weighed rather than read.
No single point of failure… and other jokes
We met in the shadow of Telecity going titsup, taking out VOIP, hosting and Amazon Web Services to a large bunch of customers. The cloud behind the silver lining is that Amazon or any other cloud vendor can be as fault tolerant, distributed and well supported as you like, but if a service like Akamai or Cloudflare was to die, you still stop.
That’s not a single point of failure in the classical sense of a standalone database server but it’s really hard to manage unless you go for full cloud agnosticism, which pushes up costs of development and delivery times. This is hard to justify when their failure rate is so low, so the irony is that the reliability of the content delivery networks means fewer businesses work out what to do if they fail.
Oh, and no one seems to test their mission-critical data centre properly, because it’s mission critical. Our IT execs shared a good laugh about the idea that any CTO would really “see what happens if this fails” if he had any doubt that the power/aircon/network might actually failover, since crashing it would be a career-changing event. So they just over-specify where they can and cross their fingers.
This means that you pay twice for some things and get the half the coverage for other vulnerabilities.
It doesn’t mean no testing happens, as one anonymous attendee explained. They wanted to test the diesel backup generators that I have mentioned with fear in the past. The story would not be as good if they had failed because the vents were hidden in attractive shrubbery. So attractive, in fact, that the company decided to move the executive car park amongst them. Industrial grade exhausts make even the most doctored diesel Volkswagen look like a fag (cigarette, to our American readers). The paint was blasted off the front of the BMWs.
At this point I have to emphasise that we operate the Round Tables under a strict Chatham House rule so names have been omitted to protect the guilty. This allowed another IT exec to share how his previous boss had a serious bromance with IBM.
Big Blue is, of course, a firm with luminous levels of integrity, so his behaviour was no doubt wholly based upon an admiration for their quality and value. The firm had reached a point where after three years on a cycle of development and deployment, it was time to take stock and look at what to do next and which resources would be needed.
Much time was spent on working out business opportunities and needs and how it would all fit together and then their boss came back with the good news that he’d met up with his IBM chums and ordered 1.5 million beer tokens' worth of hardware – which he was sure would come in useful at some point.
Ten of the twelve IT staff left within weeks
So when our IT execs talked of key man risk, they weren’t just talking of the holes left by the departure of a BOFH, but the unbounded harm that one blundering individual can cause. The interesting consensus was the way crap senior people survive.
It appears that board members can get away with pretty much any screwup, once, or if it has been a while since the last one. Partly this is due to the fact that although Reg readers mostly see tech disasters, the fact is that the rest of the business aren’t any more competent, so what seems tragic in IT often gets lost in the noise.
It ought not to shock you that Freefrom Dynamics research shows a strong correlation between major failures in IT and other parts of the business. Starve any part of the business of resources, employ dodgy managers or try to pretend that the web doesn’t exist as we heard last month and the cracks will be all over the place.
Bog standard key man risk is increasing, because of three main factors. Firstly, in the last recession we had to prune back ‘spare’ staff, having to make tough decisions on who to keep and lose and so ‘redundancy’ got used in the cost cutting rather than the reliability sense.
To make key man risk worse, this cutting correlates with allowing pay to fall behind the market, doubling up on the risk that your critical people are actively looking to quit, even if they don’t fall under a bus. HR departments fixate on “industry averages”, which is great for all IT execs who want teams that half their competitors don’t want.
The second factor is the flipside of the great wave of startups, disruptors and new cool firms is that they’ve mostly grown without any thought at all about the effect of critical staff quitting, or even working out who might be a fatal loss. Even when it is spotted, it is hard to mitigate, though our lone DevOps driven CTO eloquently explained how having everyone being able to do each other's jobs would solve this. You can believe as much of that as you like.
The final factor is that we’re in a time of wonderful new languages, frameworks, open source projects and technologies so new and cool we struggle to even pronounce their names correctly. Whereas one of the CIOs whined about good London Java devs costing £98k; at least they are as common as muck. Some skills are so rare that everyone who does it knows each other, which is itself another risk. That’s probably an acceptable risk with 30 people charging hard at a single goal. At 150, you are in a different place.
The days when you could screw over your IT staff and think that chucking them out ended the downside are behind us. There are Meetup.com groups, contractor forums and social media by the tonne (or ton - you choose), which means that your reputation will haunt you when trying to get good people at the price HR will let you pay.
It’s not just staff jumping ship that causes fear and loathing in the data centre. Apparently these damned Eurocrats are forcing employers to give IT pros holidays – and the reason one or our tech leaders isn’t at one VOIP firm any more is that it was killed when the only BOFH who really understood it was so far away that by the time he’d managed to get back, the firm had haemorrhaged too many customers to survive.
Even if you don’t have individuals whose loss will cause disaster, there is a reason airline pilots eat different dishes on flights, as one of our ITDMs shared when he had a whole six man team get spectacular food poisoning which took them out for three weeks. They were specialist enough that they couldn’t even try to get contractors in and the whole function shut down until they got well.