Sysadmin's three-line 'annoyance-buster' busts painstakingly crafted, crucial policy
Whoops! I've broken the internet... but hey, everyone gets a coffee break
Who, Me? Monday, bloody Monday. But fear not – Who, Me? has a suitably stressful story to remind you things can always be worse.
This time El Reg's weekly column of tech catastrophes comes from "Todd" who worked as an operations sysadmin for a medium-sized regional ISP.
His work was consistently plagued by an irritating alert that one day he decided to "fix".
"The alert said the security policy on our DNS headends was denying write access to a particular file whenever a particular system daemon would start," Todd explained.
"This file was not important, so we ignored the alert, until one day when I happened to be in the right frame of mind to deal with the annoyance."
So Todd wrote a security policy. "It was a three-line policy that basically said 'this file can be accessed by this process in this way'," he said.
"I gave it a name following our general practice, something like 'foobar_keepalive', only 'foobar' was something else."
Unfortunately for Todd, the 'keepalive' part was accurate – and is about to become particularly crucial to the story.
"Seeing as how this was a simple and harmless policy change, I went ahead and pushed it out to all our DNS headends, then started thinking about what to do during lunch."
As Todd was musing, noise levels in the office rose perceptibly and the internal chat got busy, displaying various errors.
'Numpty new boy' lets the boss take fall for mailbox obliterationREAD MORE
"It only took a few seconds to figure out that the problem was with DNS, and that nothing (at all) was resolving for us," he said.
"And judging from the howl coming from the building next door – where the tech support call centre was located – for all of our customers as well."
After a great deal of hollering "and one pointed 'WTF' from the boss", Todd rolled back the changes. But the problem didn't go away.
The team turned off the security policy engine on one DNS server, and at once DNS began to resolve again. "Thus propped up, we investigated the problem."
The problem was that somebody else had "painstakingly crafted another security policy named 'foobar_keepalive', which, when restored from backups, amounted to 9,000 various directives to the security policy engine".
Which Todd had overwritten with his three-line pre-lunch "annoyance-buster".
"So the service named 'keepalive' suddenly couldn't do the things it needed to do for our DNS headend servers to assure one another that they were in fact alive and withdrew themselves from routing. All of them."
Todd said that, although service was restored within a few minutes and the root cause was resolved within about 20, he had knocked everyone offline for a noticeable period during the middle of the day.
"Or, as my partner succinctly put it, I gave everyone a coffee break."
However, as his boss made abundantly clear – and Todd acknowledged – "we don't stay in business by giving our customers coffee breaks on a whim".
When was the last time you gave your colleagues and customers an unrequested break? Tell Who, Me? your story and it might feature on these pages next week. ®