WAN, bam, thank you... oh @£$%. We've gone dead. Drop the burger. RUUUUUN!
Get back to the data centre, stat
This Damn War
Big, global WAN roll-outs have their hazards. Size is one of them, but you can deal with this with effective project planning and management. Complexity is another, but combining design and a team of highly effective and competent engineers can address that one too.
And of course there are potential difficulties when you've deselected your legacy provider in favour of the new one but you still need the old vendor to do some in-depth, complex configuration changes because you're doing the change a part at a time and for a while you'll be running a hybrid system with both the old company and the new one.
None of the above struck us as a problem a few years ago. Step one of the migration was a single exercise that would move four US sites from the old setup to the new. My colleague and I flew to the east coast of the US, where our primary data centre lived. Another Brit went to the Midwest to the other data centre.
We had two of our local IT guys manning the two US offices. We'd start at 7pm EST, which was midnight back home, and the change was planned to take about nine hours or so. All the prep was done, the switch and router configs were backed up and the new ones on the internal flash ready to copy over and go, and the cables were all labelled ready to swap into the new vendor's routers.
Job done... right?
Time for something to eat then. The two of us in the east left the data centre, wandered into town and found a Five Guys burger place: looked promising, so we sat down with a typically small (yeah, right) American burger each.
Then the phone rang: it was the major incident manager from the UK. Had we changed anything yet? No, we insisted: we're sitting having dinner... but there was that nagging doubt even though we were sure we'd not made any change that would affect operation. Although we'd uploaded the new configs we'd absolutely, definitely, not made them live. And we were sure of that. Nonetheless, paranoia found us back in the data centre checking everything... only to find that everything was indeed unchanged.
Problem was that the network back home (our main core of activity where most systems lived) was intermittently accessible to the rest of the world. It'd be there one minute and ignoring pings the next.
A major incident was declared, and the legacy vendor (who provided the home systems as a managed service) set to work trying to figure it out. An hour later we were no wiser as to the problem, so I reluctantly called off the work on the basis that with a flaky network it would be impossible to figure out whether failed post-change tests were down to the change or the underlying problem.
Many, many hours spent on the phone overnight rendered little or no improvement, and so we let the service provider work on the problem while we headed to the airport for our flights home.
We agreed that the problem was something to do with the firewalls at the home location (these drove the WAN as well as the Internet connections); they were fairly old on a not-too-old-but-not-current-either software version. The provider agreed to ship a new pair – and did so by putting their top firewall guru on a flight and sending him with a cardboard box full of kit. The following day things were working pretty well, and so we relaxed and got back to normal.
Interestingly the vendor ran up the old firewalls in the lab and they behaved impeccably. Didn't miss a beat, and certainly couldn't replicate the issue we'd seen. A month later I headed back to the US and we did the planned works without a hint of a problem. (Except when the new vendor's PM told me jokingly to “f**k off” when I pointed out with tongue in cheek that although 99.9 per cent of the testing had passed there was one trivial case that failed so we might have to invoke the 11-hour rollback process … but I think I deserved that one).
What a lovely pair
Some days passed and we saw the weird behaviour start to appear again. Couldn't have been anything to do with the US network (it was all new, after all) and so we started theorising about what it might be. Our legacy vendor – whose engineers were some of the best I've ever worked with – had been racking their brains and trying more stuff in the lab and they suggested killing off one of the two firewalls in the resilient pair and working on one. Quick as you like, everything returned to normal. Hmmm.
Once we'd made this discovery, the rest was easy. The firewalls weren't in the same data centre: they were an active/standby pair with one device in each data centre and a dark fibre connection between the two DCs. They'd got into a “split brain” setup for some reason: where the standby unit lost track of the primary unit and said: “Hey, I'm going to be the master now”. Two active firewalls in an active/standby setup is seldom a good thing, and so it proved. Once we knew the problem it was easy to keep track and to force the units back into active/standby should they get confused.
We eventually found the problem: a faulty fibre module on the LAN switch at one end of the dark fibre link between the DCs. It was generally working absolutely fine except for a bazillionth of a per cent of the time, when it would go quiet for just long enough for the firewalls to lose each other.
The moral of the story: if there's complexity in connecting a resilient pair of devices, there's risk in doing so. A normal setup would have been to have the units in the same location and a LAN cable directly between the two, which would generally either be seen as up or down. Our setup had three miles of dark fibre, two fibre modules and two LAN switches and the behaviour in the event of an intermittent problem was unfamiliar to engineers who were used to the more conventional approach.
There's another moral, though. It had cost us a lot of money to get the people to the various locations, and to abandon it was costly. Taking the decision to abandon was obvious but nonetheless very scary. The second moral: don't be afraid to make a hard decision based on the facts and opinions available – it's terribly hard the first time but the confidence you gain from doing sets you up for the rest of your career. ®