Fear and Loathing in the co-lo cabinets
Is that a frog in your pocket or are The Reg's servers down again?
Nightclub shootings by Reg Towers, as happened last week, certainly didn't impress this staffer. I live in Hackney and frankly, if I want nightclub shootings there's absolutely no need for me to travel into the West End to get them. Plus if the psychotic Hackney Yardie gangster with a spray and pray gets narked in a nightclub queue, he's going to do a heap more damage than just wing a bouncer. Basically, I'm underwhelmed. And there are other jolly features of living in Hackney too — for instance, it's horribly close to the physical location of The Reg servers in Aldgate. Horribly, horribly...
There's really no way out. The techies are in Bradford, Cullen's conveniently out in the wilds of Kent, and when Birtles isn't out not quite clinching vast money-spinning deals with the movers and shakers over the odd chablis he's in flipping Merseyside. Or on a train. Or a plane.
Face it Lettice, you're IT. Which, given the state of The Register's Cisco gear over the past week*, has meant many happy hours down at Level 3 watching support engineers at play. Friends, it's been an education. But before I continue I should just pay due credit to Mike Banahan of GBDdrect, who was IT for the day Sunday of last week, who hared down from Bradford, patched things up into the wee small hours then got to sit at Kings Cross station while winos hurled abuse at one another until dawn, it being too late for him to get a hotel or catch the last train back. Well done, Mike.
Naturally, it then fell over again on Monday night, twice. After the second we left it down and called support,and at about 1am clinched an appointment with an expensive professional at 8am on Tuesday. At the time we seem not to have noticed that this didn't entirely fit with the four hour response time guaranteed in our support contract, but there you go. This is where the handily-based IT comes in. I am authorised to wander into Level 3 at all hours of the day and night, obtain a cabinet key and sit morosely on the floor in the Arctic chill of the server room while security gaily shouts "put that mobile phone out!"
I can also email Level 3 support with the name of a contractor and get them on the door list at Level 3, but if I don't have the name of the engineer who's going to show up, I can't do this, and in general the support people don't seem to know who's going to show up until, er, they actually show up. Never mind, I can go down there and sign them in, which is what happens on Tuesday at 8am.
Or not. Episode One of the mighty Cisco support machine cranks into action approximately on cue at 8am, when I overhear a bloke from DHL asking for somebody called Simon Myers at Level 3 reception. They've never heard of him, of course, but he's in Bradford anyway, and despite him always stressing this and telling the support people to ask for John Lettice, his representative on earth, they always ask for Simon Myers. And it might be worth us just doing some kind of Vulcan mind-meld job, given the extent to which we're starting to overlap. Last night, for example, a Cisco rep who's clearly losing the plot entirely emailed some useful data for Simon to me, and left a gung ho message for him on my home phone line.
Anyway, I get the net over the DHL guy before Level 3 shoves the replacement Localdirector behind the desk then starts denying it exists (if it's not on the list it doesn't, even if it's sitting next to you — one of our Compaqs may or may not currently be in this limbo there, but that's another story), and go back to waiting for the engineer. Who arrives at 9am, this having been the time his people had actually booked him for.
This triggers my first observation about how support works, or doesn't work. Note that the hardware arrives to schedule, on greased ball-bearings, and things only start to wobble when the humans get into the picture. The picture was exactly the same regarding the Compaq I just mentioned — the swapout parts were on the lorry (albeit to the wrong place) nanoseconds after we'd put the phone down, whereas the engineer... But as I say, that's another story.
Slipping our disks
Naturally, this being The Register, it's a little more complicated than just swapping out the LocalDirector, and it's this teensie complication that turns out to be fatal. We need to get the configuration off the faulty kit in order to put it back onto the replacement, but we can't do this without the password, and we don't have the password. This is sort of our fault — the company that put the system in place (did I tell you about the backup admin server that wasn't actually connected to anything else until we noticed? But that's another story as well) is now no longer with us, and the staff who'd done it were no longer with the company well before that happened. We know somebody who thinks he might know somebody who might know where the bloke who might know the password went, but for chrissakes do you really have to involve Interpol to nail down the password for your LocalDirector?
Resetting the password is actually dead simple, because all you need is a Cisco floppy disk that you shove into the floppy drive cunningly concealed under the plate on the front of the box. It's dead simple if you have the floppy disk, that is.
The engineer doesn't but never mind, Cisco's emailed it to him, he just has to get to his email. But... By happy coincidence the hard disk in his portable died that very morning. Fortunately Mr Lettice has a portable about his person, but unfortunately the floppy drive is elsewhere, Mr Lettice having more or less abandoned floppies quite some while ago. Never mind, the engineer is from NCR, so in a twinkling we can have a replacement portable shipped over from his base in Marylebone Road. This turns out to be one of life's more sustained twinklings.
Over a period of approximately five hours we wait while totting up the number of times Mr Lettice could have gone back to Hackney and got his floppy drive, the number of times the the engineer could have gone to Marylebone Road and got his email, and the number of times we could have gone to whichever safe Cisco keeps the ruddy thing in and blown it.
Engineer's boss eventually elects to drive a replacement portable down himself, claiming he'll be there in 15 minutes. No, I think, that's how long it's going to take you to park.
Eventually, we have a total of three portables with no floppy drives (it's catching), and one geriatric Tosh running Win95 with — result! — a floppy drive. Boss instructs file to be emailed to him, and he can then collect it on one of the machines with, er, no floppy drive. Now all we need is a phone point.
You guessed it. Considering that Level 3 is hosting a giant pile of servers all of them connected to the Internet already, why the blazes would you want to bother with boring old analogue phone points? You could actually just connect the portable at the cabinet, filling in the right IPs etc (as the saintly Mike Banahan had done on Sunday), but the engineers elect to wheedle access to the only available analogue line, at the front desk fax machine, instead. I discover later that Level 3 reception doesn't care about the fax line anyway, given that the fax is out of toner.
Nokia 7110s, incidentally, seem to be general issue to NCR, but setting portables up for mobile data doesn't seem to be. Mr Lettice's 7110 is set up for mobile data, but as I do it via a Psion netBook which the Win2k portable obstinately refuses to talk to, that gets us nowhere. There's a 1.5 meg download on the Nokia site that lets you use the 7110 with Win2k, and I got it the following day, just in case. I bet the NCR guys didn't, though.
Numerous "will you send that bloody email!" phone calls to NCR later, we have the file on the machine with no floppy. So we just need to set up an IR connection to the old Tosh with the floppy, and we're in business (it is now 3pm, and the site is still down). Privately boggling over the optimism of people who think an IR connection on a Win95 machine will actually work, I decide to press the reset button on the LocalDirector and see what happens. The site comes back up, and of course the IR connection doesn't work.
I now have a dilemma. I know that pushing the button brings the site up, but I know it'll go down again at some random point. Could be a couple of hours, could be 16, could be you never know. So I can actually keep us alive so long as I camp in the cabinet room (no mobile phones, no data points, no food or drink, no chairs, Arctic microclimate) for the rest of my life. It's actually quite attractive compared to the day of futile tedium I've just been through, but it really is very cold in there, and although it's warmer by the Starlabs cabinet (what a lot of Compaqs they've got), it's still pretty damn parky.
Alternatively, I could not let these two poor suckers go until it's fixed, which I calculate at being another three hours, even if everything goes according to plan (and given experience so far, how could that be?). I crack, say we'll go with the resets for the moment and reconvene tomorrow morning, where they guarantee one of them will be there, plus another techie who's cleverer than either of them, plus all of the gear they need to do the job in the first place.
Another observation — maybe they honestly believe they'll do this, but you, I and they (subconsciously) know they're out of here, never to return. This is standard support procedure. And the guy who's going to wind up coming tomorrow is the mark, who's not going to know he's been stiffed until it's too late for him to do anything about it.
The next morning a guy arrives asking for Simon Myers (the password, you remember) and although he can't get a pass because he's not one of the two guys who definitely weren't coming I'd arranged passes for, I've taken the precaution of getting my own tail down there so I can sign him in. (Note in passing this wonderful opportunity for blowing several hours of expensive techie time. You don't know who's coming so you can't arrange for them to be let in, so numerous techies arrive and hang around morosely, barking into their mobiles in reception, and then eventually shuffle off, access denied.)
Engineer three — I'm sure you guessed this as well — doesn't actually have the critical floppy, but don't worry, it's been emailed to him, so all he's got to do is pick up his email, which he hasn't yet. I note he's got a Nokia 7110, but of course... I explain the case to him, show him how to wheedle the fax line out of reception, then head for the office after leaving instructions for him to press the reset button whenever we call him.
As the day goes by it gets weirder. He does actually get the floppy and use it, then discovers the firmware in the replacement LocalDirector is severely older than in the dead one. Simon, who's on his case while I'm in meetings, claims that "they" (which I assume means another engineer may have arrived at some point) came up with the wheeze of opening up both boxes and trying to swap the bits about, but retreated in horror on discovering that in there "it was just like a PC." I'm not entirely convinced this allegation is true, although I believe the bit about it being just like a PC, but eventually, after an afternoon of the site whipping up and down like a flasher on acid, we have an operational unit with up to date firmware and we know the password.
Except that's not the end of the story. Down again goes the all-new kit in the early hours of Thursday morning, we press reset to confirm it's still a LocalDirector problem, and we call support. Support advises us to wait until it goes down again, then leave it down while they send an engineer ASAP. It goes down again at five, by which time the guy who told us that was off-shift and the replacement didn't seem to think we needed an engineer, because we were back down to a priority three, or something. We shout our way back up to a priority one again, and my email resumes filling up with automated messages; I've shouted at Cisco in the US to shout at Cisco in the Netherlands to get us an engineer now, with name and ETA and it's 6.45pm UK time.
But I think I'm going to Aldgate tonight, and when I get back I fear I'll have to write some more of this. I might even be able to post it when I'm through. Upside — we might not be getting our money's worth out of this support contract, but it's sure as hell cost somebody a packet more than the £2,300 annual wedge we pay for it, so there's a certain grim satisfaction attached.
The call from Cisco comes with the engineer's name at 8.20pm, and remarkably his ETA is 12.20am. Within four hours of them actually confirming the engineer's coming? And there was me thinking it was supposed to be from when we reported the problem... It turns out to be engineer number one again, whose mobile phone number I cunningly collected back on Tuesday. So I call him.
To the background of munching noises at his end (I'd already eaten everything in the house that could be manipulated with the 'not on hold to Cisco' hand) we discuss the case. But then the site that's supposed to be staying down until the engineer gets there comes back up. We agree this could change things, so I call Cisco, which can now consider a remote log in to check what happens. But actually they come up with the useful nugget that the site came back up 37 minutes ago, so it looks like the engineer's still coming. The engineer calls, but it's not engineer one after all — it's some mug who's now halfway from Cambridge. I tell him to call me again when he hits Tottenham, and I'll snag him and pilot him in.
He calls again, tells me he's been pulled off the case, and after a screaming u-turn is heading back to Cambridge. I call Cisco, they confirm the engineer ain't coming after all. Up in Yorkshire Simon thankfully heads for the pub before last orders, I demolish the bottle of wine I'd been looking glumly at all night, and go to bed.
We keeled again overnight, apparently, but Simon dug some luckless Level 3 techie out of bed at 6am so he could go in and push the button. Whew, could have been me. He tells me that Cisco has juiced up the logging on the LocalDirector, raving that they've only done so six days into the case. But we must be reasonable — given that we didn't know the password for some of that period, they could surely only have done so 36 hours ago, or thereabouts.
We resume waiting. Simon considers that as all of our kit is working and that we're running a complete new LocalDirector, it does kind of look like the problem is somewhere beyond our cabinet. What, for example, if some klutz has duped one of our IP addresses on another piece of kit? Wouldn't that slay us every time it gibbered into action? We're going to have the damned job getting whoever it is to admit it, but if we can get close maybe they'll stop and carry on denying they ever did it in the first place.
Through Friday we don't go down. I do go to Level 3 on the way home to seek out the MIA Compaq server which Compaq may or may not have repaired. My not having been in Level 3 reception when it arrived, it's in the bowels of the building addressed to an indeterminate person and signed in as being from the repair company, not Compaq. Easy-peasy. I prove it's mine by telling them what the label we stuck on the front says, shove it in the rack, plug in and run.
We still don't go down. Is it... over? Shush... If so, this untracked issue has probably cost us about a third of our traffic over the past week, has cost Cisco and/or NCR absolute piles of dough, and whatever it was fixed itself (maybe) without our being able to affect it. Weirdly, we do now seem to have a backup available, although it's not, er, exactly ours. Back on Tuesday engineer one and I pulled the replacement LocalDirector out of its packaging and shoved it in the cabinet before we aborted that day's mission. A plaintive call from engineer three on Wednesday however revealed that Level 3 security had tossed the packaging, which engineer one had left in the aisle. Cisco won't accept it back without documentation, so he shoves it in the cabinet. He calls me again Friday and gives me a phone number for the people at Cisco who can facilitate my returning it. I consider the alternative strategy of charging them for rack space, and security... ®
* In order to keep the helpful suggestions in line a tad, we'd just like to point out that we know having a single point of failure in the shape of the LocalDirector is unwise. But it's stayed up for over a year, and the entry cost of the lot when we bought it was around £12k. We really could not afford two of these, although we're now checking out Ebay for bankrupt stock (buy two or three, stuff the support, just shove one in when the first one breaks). Or we're looking at a couple of alternatives, but they have to be next to free, and it's only the load balancing we've got to get back on top of, so we don't want new servers unless you give them to us for nothing, we don't want expensive total rip and replace services, and we're only interesting in hosting deals if you savagely undercut Level 3. Feel free to make suggestions anyway, but vendors, you're in it for the glory, not the money, OK?
Sponsored: Global DDoS threat landscape report