BOFH: The Great Patch Mismatch
Halon, the noblest of gases
"It's just a minor ROM patch." the service engineer bleats "It'll only take five minutes."
"Yeah... Nah," the PFY says.
"It's minor - just addresses a couple of memory leaks and and cookie issues in the web interface."
"Yeah. Nah," I repeat.
"It's just the interface - the UPS will be completely unaffected!"
"Nope," the PFY says.
"My assistant and I subscribe to the belief that if it ain't broke it don't need fixing," I add.
"But it's a mandatory fix!"
"We've got a mandatory change freeze on server room gear."
"Hang on, I'll just check the thermometer..."
"To see if Hell's frozen over..."
"But if you don't do the upgrade you won't get the advanced diagnostics!"
"Which would uncover hitherto undetectable problems?"
"Provide us with enhanced predictive failure?"
"Report error conditions in a different manner?"
"Yes, instead of the light going orange when there's an error, it now goes red. And the code on the diagnostic display changes from 000 to :-["
"Well that seems to be worthwhile having!" the Boss says, having slipped into Mission Control unnoticed.
"No it's not!" I chip in.
"It is," the service droid says. "And when each time your reboot and the self test passes the status changes to :-]"
"We should get that," the Boss says, with not even a hint of sarcasm!
"Nah, we shouldn't," the PFY says.
"It's completely transparent, and it can all be done online!" looking to the Boss for papal intercession.
"It's one of those shiny bead situations," I say angrily to the PFY a few minutes later when the Boss has vetoed our veto and told the engineer to go ahead. "Like those RAM fail front panel LEDs."
"?" the PFY asks
"RAM Fail LEDS, on the front panel instead of on the motherboard. Your RAM craps out and the corresponding LED lights up on the front panel. Everyone loved them until they found out you had to disconnect the front panel to open the lid. The lamps would go out and then you'd realise that the front panel LEDs were numbered 1 to 8 whilst the RAM modules were labelled A1 to A4 and B1 to B4, so then you'd have to guess whether faulty module 5 was module B1 or A3. But they bloody looked cool when your server fell over - and that's the important thing."
"Well if the UPS upgrade is transparent then what diff..."
"What's that?" the Boss asks, obviously hovering outside Mission Control with conscience pangs.
"That's Nagios reporting that the UPS interface is offline."
"So it's down?"
"The interface, yes. But not UPS1. If it was the UPS1, my desk would be down and the mains fail buzzer on the PFY's desk would sound. Unless UPS2 went down too, and then there would just be silence."
. . .
"I... need to restart the interface," the engineer says, leaning around the doorway to Mission Control about a few minutes of tapping and prodding later.
"Because you mixed up the terms transparency and opacity?" I ask.
"It'll only take a minute."
"I think we'll schedule that for another time," I say.
"But I've already put the service key in."
"Then take it out."
"If I take it out the interface will restart!"
"Then leave it in."
"If I leave it in the UPS will stay in Bypass mode."
"YOU PUT IT IN BYPASS MODE!!" the PFY snaps.
"Shouldn't it warn you that it's in bypass?" the Boss asks.
"NOT IF THE BLOODY INTERFACE ISN'T WORKING!" I seethe.
"Well we need to get it back online don't we?" the engineer hints.
"No, we'll use the external mechanical bypass and then restart the UPS," I say.
"It's in internal battery bypass," the engineer explains. "It's still converting to DC and back out to AC, so if you put it in external bypass without telling it what you're doing it could short out the inverters."
"Then tell it what you're doing!" the Boss says, predictably.
"I can't - the interface is down!"
"So if you restart the interface, will it shut down the UPS?" the Boss asks.
"No, the interface and the UPS aren't connected like that. It's just a serial connection."
... 10 minutes later...
>click! Whhrrrrrrr..< >buzz buzz< >buzz buzz< >buzz buzz<...
"Not connected like that," I say to the Boss as my desk descends into dark silence.
"So the servers are down?"
"No, 98 per cent of them will still be up because we have dual UPS units."
"Don't worry!" the engineer says. "It's coming up now. Be back in a jiffy!"
. . .
"Mmm?" the PFY says as the engineer pops his head back into Mission Control cautiously.
"It's detected a ROM mismatch with the other UPS."
"How?" the Boss asks.
"They're connected. So the updated UPS has seen the other UPS is back-rev and won't complete start-up until the other one is up to date. We need to update the other UPS."
"And have the whole room power down? I don't THINK so!" the PFY snaps.
"No, all we need to do is update the firmware. It doesn't have to boot from the new firmware, it just has to have it on board."
"Can't we just disconnect the cable so that the UPS thinks it's standalone?" the PFY suggests.
"No, it's in a cluster."
"Can't we uncluster it?"
"Sure, we can do that from the interface once it's booted."
"So... we're going to have to update the other UPS - but NOT reboot it?"
"Yes, it's just a patch, not a restart."
... Five minutes later ...
"And this," the PFY says to the Boss, "is what silence is like. Notice the absence of phone calls - because the phone system hangs of the UPS units."
"Notice," I say, "the sudden lack of fresh air - because the aircon system has noted a power anomaly and shut down all the large chiller and fresh-air-fan motors."
"Notice," the PFY says, "the sound of sirens in the distance - because parts of our fire system which SHOULD have been replaced use normally open relays to detect a fire condition."
"Notice," I say, "the complete lack of sound from the server room. Not a single whirr, no buzzes, no clicks. Not even the sound of an engineer trying to explain something - because the halon system*'s connected to the crappy old relays."
"Wasn't it all brand new sensors in there?"
"Who can tell? Perhaps you can go and check?"
"Uh.. no, I'm sure you're right," the Boss says nervously.
"No seriously, go and check!"
"No, it's fine - I'm sure I can see perfectly through the viewing window."
"Ah good, so at least we salvaged something from today!" the PFY comments.
"You know what transparent means now..." ®
* A fire suppression system that works with halon gas.
Sponsored: Becoming a Pragmatic Security Leader