Sysadmins: Your best tale of woe wins a PRIZE
Hey users, it's Sysadmin Day next month. We want cake
Competition I'd like to share with you the plight of a good friend of mine. He's a systems administrator for a mid-sized American accounting firm. His story isn't particularly remarkable, but all the more important because of it.
Bob has a bachelor's in computer science, and a lot more experience than I'd normally expect someone clutching their newly-minted university degree to have. Bob put himself through school working as a junior systems administrator, and has sort of stuck to the job since.
Bob is now a Windows server admin charged also with fixing the phone system and poking the routers when they blow up. As the server guy, he is responsible for the out-of-hours maintenance. Documentation is also his responsibility lap; a project that he puts a great deal of effort into, but which never quite gets finished.
Corporate belt tightening being what it is, the arrival of additional work is no longer accompanied by additional bodies. Instead, departments get reorganised and now Bob has to do "that little bit more".
Bob's new "logistics" assignment is a nice way of saying he is now also responsible for cleaning the phone room. Bob is now a janitor. The new "service desk" task means that he has inherited the role of first responder on the corporate ticket system. Our friend Bob is now asked to keep the users happy in real time and also has to stay late to fix the servers when they need fixing. In between crises, there is an increasing push to create documentation for everything from the Hyper-V servers to Exchange that allow common tasks to be delegated to end users in a branch office.
Bob doesn't get much sleep anymore.
Every now and again Bob pops onto IRC with tales of some cabling nightmare he's encountered, an EBKAC or some bit of glorious faceplanting on behalf of the folks higher up the food chain. The chance to get it all off his chest seems to help poor Bob; it knits together the tattered fragments of his sanity long enough to resume the daily grind of whacking moles. While it goes without saying that sysadmins have to be adaptable, each and every one of us has at times felt underappreciated, burnt out and underresourced. We've all been where Bob is, and we've the horror stories about cabling, applications, vendors and pointy haired bosses to prove it. (There may possibly be a few user stories in there, who knows.)
July 27th is Systems Administrator's day. Those of you who aren't sysadmins have a month to find a way to thank those who keep the internet running. Cake is appreciated, and coffee is always nice. A coastal villa is a handy substitute, if you've the time.
For those of you who are sysadmins, take the time to vent in the comments below. If you've got a real doozy – and you're based in the US – then submit it to Solar Winds' Systems Administrator's day contest (rules here*). There's an iPad, a Thinkgeek gift card and an original piece of artwork depicting your story up for grabs. Oh, and apparently I'm one of the judges. So there's that. Since I'm ineligible, I'm counting on you lot to keep the contest properly supplied with a soul-rending collection of sysadmin nightmares. Good luck to you all. ®
* One of these is that entrants have to be US residents
We once purchased a server with Windows Server pre-installed. By the time it was delivered plans had changed and it was reformatted to run Linux. Three months later it broke down. Looked like a simple failed PSU to me, but it was still on warranty, so we called for an engineer.
Some hours later he told us it was OK again and left at a trot. We were surprised that he hadn't left it powered up, and dispatched someone to the machine room to boot it. But it was booted ...
and once again running Windows! The muppet thought GRUB was a hardware error, so he reformatted the disk array and reinstalled Windows. Thank heaven for backups, and that it wasn't desperately mission-critical.
You don't have to outsource to India to get muppets.
Re: Too many to choose from...
Years back, we had a 'Unix consultant' who decided that he'd empty the user trash can on the server - the directory called /usr/bin - then wondered why the server stopped working.
I once was caught by a network outage whose ultimate cause was:
An X509 certificate, in a Java package, had expired. The Java package was part of a UPS monitoring software (and nothing you'd ever normally play with). The UPS monitoring software went absolutely crazy and just hung the entire machine (as in, your mouse couldn't move across the screen in under a minute). The machines affected were pretty important and all of them had the problem. Because they still responded to some queries, pings, etc. (just about) nothing bothered to failover anyway but even the failovers had failed. And you couldn't log into the damn things or use any of them directly.
And obviously, the entire network slowed to an absolute crawl affecting the entire place it was in. Sure, we could force failover to a clean machine but those servers had apps running and we needed those apps (data - pish, we had that, but if you can't submit the payroll without that heavily-locked-down smartcard tied only to those machines you were stuffed).
Sounds like nothing but go through your standard procedures - everything responds, everything responds to ping, just some things are very slow. Can't narrow down to one server. Try to log in to them remotely and it timeouts. Go find the physical machines. Try to log into them. LITERALLY could not get the mouse to move. Suspect hung servers. Wait suitable length of time. Force a reboot. Server boots up at normal speed, does disk check, nothing wrong, gets to login screen, hangs like hell.
Okay. Weird. Safe mode. Same thing. Okay. Disconnect from network. Same thing. Restore from latest backup (which takes HOURS). Same thing. Okay, restore from Known Good backup (more hours!). Same thing. Confirm the machine is fully operational with install from clean Windows CD (but too much software on there to just replace every server in the space of a day or so from nothing). Perfectly working, no problems at all. Okay, restore from ancient backup taken just after the Stone Age and which DEFINITELY, DEFINITELY, DEFINITELY works. Same thing. WTH? Try other machines. Same thing.
Restore a backup to an identical machine acting as tepid-spare. Same thing. Bugger. Complete reinstall looking like only way to get it back and working. Clean Windows install on clean, isolated machine. Start reinstalling software from disks (known-good versions of everything that worked for YEARS). The reboot AFTER the UPS software is installed starts displaying the same symptoms. Damn! (But at least we can narrow down). After a couple of installation retries (and rebooting after literally every step), narrow it down to the UPS software installation. Update available? No. Reported problems? Yes, with same symptoms. No solutions.
Fall back onto the last resort of IT: Phone random IT people and see if they've had the same thing. Couple of them experiencing it, one of them in the advanced stages of diagnosis. After much jiggery-pokery and literally just turning the software on under heavy debugging/monitoring (and LOTS of rebooting) and trying to identify cause, narrow it down to accessing a certain Java package and then, inside that, find an expired certificate. How do we know it's the cert? Can you fool it by putting the clocks back? Actually, yes. Before the expiry time, it works, after it doesn't. Can you replace the cert? Yes, but the software refuses to work even if it doesn't hang. Can you fake it? No. Can you remove it? No, the software won't work. Can you run your processes with an inaccurate clock? No. Can you run your server without UPS monitoring? Not really.
So it takes this long to get to knowing what's happened and what you can do about it. The affected servers can have their filesystem accessed by something else (Linux boot disk) and have the UPS software disabled, because there was literally no other way to get the software turned off once Windows had started (and actually, I moved the UPS monitoring to a Linux machine that then would issue shutdown notices to the Windows servers instead!).
Eventually, after about a week, the UPS manufacturer issue an update and pretend it never happened (and NOBODY installed that update without first installing clean or working out what the problem was). But you try diagnosing that bugger while having a whole company breathing down your neck and silence in the server rooms.
What saved the day? Linux, contacts, careful investigation, no blind reinstalling, and not relying on the UPS manufacturer to actually DO anything about it.
The last thing you expect when your whole network goes down (and only the servers, not the clients) is some UPS software that's been working FOREVER, on known-good versions, on known-good servers to suddenly stop working even on restoration of older, working backups (whose restoration had ALWAYS worked before that point when testing backups) and then hang the machine to the extent that you literally could not do ANYTHING on it at all. I was just about ready to kill Microsoft until we saw that no updates had taken place in the weeks leading up to it and the servers had been rebooted and backed up and test-restored since then.