Sysadmins: Poor capacity planning is not our fault
Let us explain why things are so crap
Our latest reader survey was a little different to usual. Normally we research new stuff like the latest hot technologies and ideas. On this occasion, though, we looked at a discipline that's been around for decades – capacity planning.
The aim was actually to investigate how well existing processes, techniques and tools in this area were coping with some of the latest dynamics and growth-rates that readers have told us about in other studies. What we discovered, however, was that many don't even have what we might arguably describe as ‘the basics' properly covered. Indeed, the most common approaches to capacity planning remain overprovisioning and "winging it", ie, relying on instinct/vigilance and the odd spreadsheet.
Of course some argue that the informal, ad hoc approach is perfectly adequate, and that getting too ‘procedural' is more trouble than it's worth. Fair enough if you have a modest and relatively slow-moving IT environment, and a small IT team in which everyone always knows what everyone else is up to. But with nearly 60 per cent reporting downtime and/or service degradation as a result of capacity-related issues, and around 50 per cent talking about costly and disruptive emergency procurements when resource limits are unexpectedly reached, the approach being taken by most clearly isn't working.
As with all surveys, however, the numbers only tell part of the story. We can get a little more insight into the real world practicalities if we look at how readers express the issues and challenges in their own words. Something that comes through very strongly here is a suggestion that it's often not the fault of sysadmins when capacity problems arise. Thoughtlessness and lack of discipline on the part of architects, developers and others in IT came through strongly when asked for examples of preventable situations:
“A decision to centralize most of our servers, without first looking at what network changes would need to happen to make it work.”
“Developers who don't think they need to do performance testing, just throw the biggest VM at it and forget about it.”
“Lack of timely communication from application teams about upcoming projects.”
“Team was complaining that the job scheduling system was breaking their job. Their own logs stated they had used all 100GB of their shared storage, and 3 escalation teams never bothered to check.”
“Snapshots that consume the entire storage a VM is hosted on, and not just once.”
“Oh this is just a test system for one user...”
But while it's easy to point the finger within IT, the reality is that capacity planning and management is actually a business issue when it comes right down to it. Sure, most tell us that it's IT staff that get it in the neck when capacity-related incidents occur, but some other ‘preventable situations' described by readers suggest that bad business outcomes can often be traced back to thoughtlessness and lack of communication on the part of business people themselves:
“The Business Intelligence folk not understanding data cubes, and enabling them for the whole data warehouse 'to see what would happen'.”
“The same community of users running multiple resource-intensive activities at the same time. In some cases it's the same person, who then asks 'what's wrong with the network, server and/or storage'.”
“Failure to delete data for expunged users, many over 3 years gone from the premises. But, hey, that's irrelevant to the fact that you now don't have space for your active project, and the tape backup keeps stopping mid-job because you are backing up so much expunged user data.”
“CEO video being a 'must see' on a Citrix platform grinding the entire business to a halt because the rendering hadn't been set to client side.”
Some readers alluded to poor training within the business, and certainly raising awareness and trying to get people to help themselves will be useful. From experience in other areas such as security and data protection, however, you have to be realistic about the ability of the average user to appreciate the practicalities and become motivated to act appropriately.
Whichever way you cut it, no matter how much some in both IT and the business are in denial, it all comes back to putting robust policies and processes in place, and making sure you are spending the right amount of money on the right things. In larger and more complex environments especially, this includes good planning, analytics and monitoring tools so IT teams can maintain the visibility and insight needed to manage resources effectively.
And this brings us to the final aspect of capacity management that we'll touch on here, that of senior air cover. Without this, it is extremely hard to both instil discipline and secure funding for infrastructure and tooling. But getting execs to even listen, let alone understand what matters and provide the necessary support, comes though strongly as a common frustration:
“It's all about cost - no capacity management or appreciation.”
“Repeatedly informing management of system running out of resources until it eventually has exhausted resources resulting in panic procurement.”
“Managers who think the cloud is a means to reduce cost over co-lo, without actually thinking it through.”
“Management, business and 3rd party provider taking 6 months to approve and implement additional storage so production POS went down.”
That last comment underlines the notion that it really is a mistake to regard capacity planning and management as purely an ‘IT thing' or simply a chore for sysadmins to take care of. With IT systems being so fundamental to so many processes and functions across most organisations today, it really needs to be thought of as an aspect of business risk management.
If you are interested in more ammunition to get senior management to take the issues seriously, then you'll find some compelling and thought provoking stats in our full research summary, which can be downloaded from here [PDF]. ®