Red rag, meet bull: The software resilience gamble
This topic really got you going
New research alert You, the fine Reg readers, recently regaled us with the gory details of your application failures - and it ain't too pretty. It turns out that a large majority of you find business is disrupted by app failure way too often.
Of the 1200+ readers who took part in the research, a whopping 84 per cent said their business suffered disruption caused by application failure at least once a quarter, with 33 per cent complaining of the same once a month and 24 per cent once a week.
But not every application failure is catastrophic, and it’s important to look at impact as well as frequency. The majority of failures simply result in a degree of user inconvenience. However, incidents with more serious consequences occur more frequently than most people might imagine. For example, one in five organisations confess to suffering tangible business damage from application failure on at least a quarterly basis.
Size of business has little impact on frequency, though different availability hotspots do exist. It was interesting to see that email availability is the bugbear for smaller businesses. That’s not funny if, like Freeform, email is a crucial lubricant to doing business.
So from a business perspective, it makes for dismal reading. No business would consciously sign up to this level of application failure.
Judging by the number and length of open comments we received from you, the work being caused in this area fosters a fair degree of frustration among IT professionals.
So how is this burden on the IT department being generated?
The first and obvious culprit is ‘stuff just fails occasionally’. Is resiliency, then, a well-worn topic within the software development lifecycle? Nope. It doesn’t get a look-in in most organisations when scoping and budgets are laid out. Essentially, ‘insurance’ (think of things like automatic failover) isn’t given much thought until after things have fallen over.
What else is adding to the frustration? We picked up a lot of anecdotal evidence that operational IT feels largely ignored during the software development lifecycle. So it ends up managing applications which are not ‘designed for operations’ and represent a risk, both to the company and to their own workload.
At the coal face, we got the impression that the SLAs in play in many organisations aren’t worth the pdf they’re written on. Neither is the monitoring. Why else would 76 per cent of respondents tell us they don’t get enough warning of problems?
On the upside, the data showed very clearly that minimising exposure to failure can be achieved through a combination of good process and appropriate technology to provide system resiliency and/or rapid recovery. A structured approach to defining and specifying application software projects needs to include input from the right people (ie you lot); Resiliency and availability need to be considered early in the project lifecycle, and explicit investment in appropriate fault-tolerance and recovery solutions can all have a significant impact on reducing the frequency of disruption due to system failure.
Sadly, there’s a significant gap between where most organisations are and their ideal position. The evidence, however, points to a need for some practical but fundamental changes which IT can drive to help businesses take the gamble out of software resilience.
Get your mitts on the full report right here.
As usual, feedback very welcome. ®
Awww, diddums, didda nasty net admin delete all your pr0n again?
The management wants the problem solved for a cost minimum. However, for a little more, you could create the optimal solution. Optimal solution is denied.
2-3 years later, it's time to upgrade. Poor solution requires twice as much work as the optimal solution to upgrade, but for a little more, it could be upgraded to the optimal solution. Optimal solution is denied.
3-4 years later, it's time to upgrade again. Poor solution requires 4x as much work as the optimal solution to upgrade, but for a little more, it could be upgraded to the optimal solution. Optimal solution is denied.
See the pattern here? Crap-code snowballs over time, until the point when it becomes totally unworkable, and the management throws up its hands and says "we need to spend a huge amount of time and money to get XYZ software and train our people, because we can't make our software work anymore".
If you do things right the first time, the code lasts a loooooooooooooong time. And you get what you pay for. Anything less, and you pay more for it over time.
Every car ever built is made from bits that are "about the right size", it's called tolerancing, you do it to every single bit & then work out wheter or not it's likely that all the bits will still fit together. The Japanese dominated by working out which expensive bits didn't need close tolerances & which cheap bits did