Label your cables: A cautionary tale from the server room
A SCSI-brained 'friend' in need is a pain indeed
This damn war Friends of friends can be a pain in the nuts when you're in IT: you find yourself invited to someone's house “for a beer” and then somehow end up kneeling in a dark, dusty under-stair cupboard trying to coax a six-year-old router back into life.
The only thing potentially more daunting is former employees of suppliers of clients. Which is who phoned me one day and asked whether I could set up a SQL Server cluster for his client.
Yes, the client of the former employee of the supplier of one of my clients.
Being a responsible consultant and having never actually done a proper SQL Server cluster before, the answer was, of course… errr … well … actually it was “yes.” I figured it couldn't be all that hard and so dug out the cheat sheet from the Microsoft web site and set off.
On arriving in the London data centre, I checked out the spanky new Dell server pair and storage array, as supplied by Dell in a kind of “server-storage-cluster” bundle. Each server had a pair of RAIDed boot drives, so in no time I had the operating system and SQL Server installed. Next was to get the RAID array configured.
It was pretty obvious how to plumb it: a SCSI port in each end of the array and a single port on each server's SCSI card. After some playing, though, it became clear that the setup just didn't want to work with both servers powered up at once.
In the end I swallowed my pride and called Dell's support line. They very helpful chap walked me through various tests and checks, and it soon became clear that the problem was that there should have been a RAID adaptor in each server, not just the SCSI adaptor, and that they'd been missed from the kit that was shipped. The RAID adaptors would talk to each other through the SCSI bus but would be polite and co-operative with each other rather than exhibiting the downright hostility of the SCSI cards.
“We'll courier you the right parts overnight”, they said. Sure enough they did, and by midday the next day everything was rocking and rolling, we'd tested failover loads of times, and the client was happy.
A year or so passed. Then one Sunday the phone rang. It was the end client (you know, the client of the former employee of… ) and he said, “The server won't come back up”. This rang an alarm bell: it wasn't “won't come up” but “won't come back up.” I embarked on the usual line of enquiry: “Has anything changed?” Answer: “I don't think so.”
After a while, a reasonably important fact emerged: “Well, we've moved to a new data centre, but I'm sure I've set it up as it should be.”
I gave him a few things to try. Poor guy was knackered, as he'd basically spent the previous 24 hours moving the entire installation across London single-handed and he absolutely had to get the server up ready for business the next morning. I told him to try a few things (check out the RAID card BIOS to see whether it could refresh its settings, and so on) and then get some kip, and to call me at lunchtime if it was still down; I had a commitment until 2pm but could hop in the car for a trip to London after that.
Sure enough, all was still broken at 2pm so I arrived at the new data centre at around five.
Powering up the servers wasn't pretty. They whined like spoilt children about their missing disks, though since the boot volumes were on-board they would at least wake up into Windows. Not that this helped at all, of course. So I checked out the pre-boot menus on the RAID adaptors and they weren't seeing any storage – which was odd, because it was definitely there and claiming to be healthy.
I figured I'd check out the pre-boot menus on the SCSI adaptors, and got a lot more complaining about “I can see other crap on this SCSI bus, please don't make me talk to it”.
Eh? There's other crap on the SCSI bus?
The penny dropped and I powered everything down, went round the back, and moved the cables on the servers from the SCSI cards into the RAID cards. Which, to be fair to the client, looked almost exactly like the SCSI cards.
Crossing everything firmly, I powered everything back up and there was a whole lot less whinging. Still not happy, though, and still some moans about: “Hey, I can see some disks… but something doesn't look right”.
Another power down, and this time I swapped the cables over between the machines – so the one from server 1 was moved to server 2, and vice versa. Again, a lot of crossed digits and a push on the power button: no moaning at all … but Windows not showing any storage. Another restart and into the pre-boot menu of the RAID array.
When I asked it about what storage it thought it had, it basically said: “Hmmm, there's some weird shit going on here and I think my config is a bit screwy; would you like me to re-read it from the copy I kept on the disk array?”
That'll be a “yes please”, then.
Save-and-Exit … wait … voila! One perfectly functional cluster, and with a bit of failover testing I was back on the road less than an hour after I'd arrived.
The moral of the story for the client: if you don't know anything about it, and you really must do something risky with it, spend a quid on a pack of labels.
The moral of the story for the techie: never believe a user when they tell you that nothing's changed.
Or, if you want a more generally useful moral: never believe a user. ®
Got war stories of your own you’d ike to share? Drop us an email to email@example.com with the subject This Damn War.