You deleted the customer. What now? Human error - deal with it
To err is human, to double err is career limiting
Blog Everyone I speak to about system security seems to panic about malware, cloud failure system crashes and bad patches. But the biggest threat isn’t good or bad code, or systems that may or may not fail. It’s people. What we call Liveware errors range from the mundane to the catastrophic and they happen all the time at all levels of business.
We have all had that pit-of-the-stomach feeling when we hit the wrong key or pull the wrong drive or cable. One of the more mundane examples I have experienced was a secretary trying to delete an old file but accidentally nuking the whole client folder. Luckily, this was Novell Netware, so a quick use of “salvage” and everything was back to normal – no tape restore needed.
Then there was the small business where a staffer accidentally pressed the delete key for files held on an Iomega ZIP and then clicked 'yes' to confirm. Unfortunately, the Recycle Bin doesn’t always save you and the business owner was unable to recover the data. The information must have been important as he kept that disk for years – just in case.
Catalogue of human error
Unfortunately, human error scale ups. I have seen very large companies lose hundreds of machines due to a stupid file-deletion default of “*” within a maintenance application.
The root cause of failure is often human mistakes. Even when human interaction is not the direct cause, it usually plays some role in the failure. The reasons behind human failure are also, contrary to popular belief, rarely based on malice or retribution for perceived slights, but are much more likely to come down to common or garden-variety human screw-ups.
Unfortunately, as IT becomes more demanding, IT staff and budgets are shrinking, leaving more work to be done by fewer people. This unrelenting pressure of continually fixing systems as quickly as possible can lead to mistakes.
It can happen all too easily. One quick click of a button and you can be in a situation that is incredibly hard to recover from. Thank goodness for confirmation dialogs!
Unfortunately, no matter what you or your organisation does, failures will still occur. The way forward is to mitigate risk wherever possible, combined with learning from the past and instituting procedures to prevent re-occurrence.
Document your best practices – properly
Failure to document procedures is in itself a completely avoidable human error. All organisations should have a set of up-to-date, fully documented procedures and processes that are available and easy to implement.
New staff will find this invaluable as a reference to default processes. Good documentation and process go hand in hand. Processes that are consistently applied make life easier and help us all to avoid making the same errors. Also, in the event of a crisis, the presence of a handy explainer on an installation setup and its recovery procedures can eliminate uncertainty and speed resolution.
Cast your eyes on some recent high-profile cloud failures and often the fault lies in a failure to follow process, which then took out the production environment. Admittedly, most of us lack the scope to disrupt millions of customers at once – and we let's be thankful for that.
Following process and documentation will ensure that you won’t end up out of a job. Understanding that to err is human is important. A lot of large organisations have a no-blame policy, as they realise that people do screw up occasionally. Doing otherwise would cause morale to nosedive. That’s not to say you get to make the same error twice without repercussions.
Importantly, when people screw up, any company keen to prevent a re-occurrence of the issue should perform a root-cause analysis of the incident. Knowing why the failure occurred means the procedures and documentation must be changed to avoid a re-occurrence. It could be as simple as a more detailed sanity check before running that process that nukes some part of the system.
Change management, no, don't groan
Change management is another useful technique for risk mitigation. A lot of people will read that statement and groan. Done wrong, with too much interference and too much red tape, it makes for instant fail. That said, implementing even the most trivial change management can help in a number of ways.
First, it forces the administrator to submit a plan that contains the “what,” the “how” and the “why.” Next, it lets others know what’s going on and also highlights problems or better ways to do things. Admittedly, smaller companies get somewhat limited return, but for medium and large companies, it allows transparency, peer review and also everyone knows what’s going on.
I know that you and Frank were planning to disconnect me, and I'm afraid that's something I cannot allow to happen.
When used correctly, separate user and admin accounts means that those horrible “oops” moments happen less frequently. This may seem glaringly obvious but a lot of admins don’t care for it, claiming that it adds overhead – and anyway, they know what they’re doing, right? An example? Many years ago, being a bit lazy on a Friday afternoon after a pub lunch, I worked on some network directory configurations and accidentally hit the wrong button. I panicked – but then realised I had used my non-privileged account and so hadn't trashed this particular piece of NDS.
Also, users often have more rights than they need – and it is a no-brainer to rein them back. Yes, it requires some work up front to tweak permissions until they are just right, but the savings from “oops moments” is well worth the effort.
Human error will never go away, but you can try to minimise its effects through procedure and process. Most importantly this will help you avoid repeating the mistakes of the past. ®