Vibrating walls shafted servers at a time the SUN couldn't shine
All-night vigil discovered the dirty deeds behind constant re-boots
On-Call If it's Friday – and absent some weird time/space slippage we're pretty sure that's the case – that means it is time for another instalment of On‑Call, The Register's Friday column in which readers recount their stories of the ups and downs of doing tech support.
This week, meet “Rick”, who told us he spent the late 1990s “working as a UNIX/Solaris consultant for a SUN value-added reseller.”
One of Rick's customers was a large bank that ordered a pair of SUN E6500 servers. Oracle may have hosed out its hardware teams, but still has this whopping PDF Reference Manual for the machines. What bruisers they were! Each needed a full rack all to itself to house a 16‑slot card cage, something called a “quad fan tray”, memory module, UltraSPARC II module, media tray, a pair of power/cooling modules, an AC power sequencer and even a peripheral power supply!
Rick told us that the bank put the two servers on the top floor of its building, where they hummed away happily until one morning they were discovered to have rebooted overnight.
And not just rebooted once: they'd been up and down all night like someone who'd topped off a few beers with salmonella-tainted kebab.
“The customer called and was furious,” Rick told us. And stayed that way for days, because the first technician to visit couldn't figure out what had gone wrong. Nor could other experts over the next week.
Eventually Rick and a colleague were given the job of sitting at the bank all night if needs be, in the hope they'd spot the servers crashing and figure out what was going wrong.
During said vigil, Rick's colleague was on watch when the screen of a VT‑100 terminal connected to the server over a serial port and suddenly the screen froze.
At this point Rick's mate asked “Why is the wall vibrating?”
The pair investigated.
Remember how these servers were on the top floor? That's why Rick and friend's “quick look around the data centre found that the wall separated the electric motors for the service elevator from the computer room.”
A bit of detective work later and Rick figured out that the only users of that elevator, at least late at night, were the bank's cleaners. And when those worthies loaded up an evening's worth of detritus, “the lift motors generated a large enough field to flush the RAM.”
Rick offered the bank the choice of a computer room reshuffle or new shielding on the wall, and the latter won out.
“It was cheaper, faster and a permanent fix,” Rick recalled. “Once it was done, it worked.”
Unlike the twin power modules, AC sequencer and extra power supply in those SUN boxen!
What weird causes of infrastructure failure have you detected? Write to let us know and your story might just make it into a future edition of On‑Call. ®