Belt and braces stop the network falling down
Count everything, then double it
Lately, everyone seems to have lined up to join the network failure party. In some cases, lax network security has been to blame.
In others, upgrade issues coupled with fundamental design flaws have done the damage.
An inability to cope with denial-of-service attacks by angry internet mobs has even resulted in disruptions to networks that are arguably among the most important on the planet.
The failure of so many major corporations and government agencies to prevent network mishaps is far from an excuse for regular sysadmins to slack. Quite the opposite: these events highlight the general public's increasing unwillingness to forgive such breakdowns.
Whale of a time
Network outages lead to damning press coverage that can tarnish a brand for years, and even lead some to question the maturity of cloud computing. Network failure is so common that it even has its own whale mascot.
At a minimum, network redundancy requires two of everything: every switch, router, network card and cable on the network. Ideally, there should be three of everything. This allows you to take one set of equipment offline for scheduled maintenance with both a primary and backup remaining active.
Preventative maintenance touches on an element of network redundancy far more important than the choice of hardware. The single most important element of network redundancy is a philosophy.
We in IT don't get to live by the axiom "if it ain't broke don't fix it". It is always broken. Even when it doesn’t appear to be broken, there is bound to be a security flaw in the code somewhere that you will eventually have to patch.
Go forth and multiply
This means that while there should be three of everything, there really should also be four of almost everything. The fourth set is the sandbox: a research and development environment that needs be in place to run through configurations before deployment.
The internet is littered with stories of network admins who botched an update by forgetting a single character in a config file. If you are charged with running a mission-critical network, you really don't want that to be you.
This is complicated further if, like most sysadmins these days, you are not always on site when upgrades are taking place. Toying with network gear is uniquely risky because the network is your access point for remote administration.
Having a back door is important. It doesn't have to be a sexy one, just a functional one. If you run a massive hyperscale data centre, then you probably have multiple redundant fibre-optic links providing you with all sorts of ways into the data centre. Smaller organisations don't have this luxury.
Fax of life
The right equipment can help you here, and it is available dirt cheap or even free. Consider a branch office scenario. If the branch office has a fax line, you are in business.
A simple appliance called the Stick can take that fax line you are already paying for and enable you to dial in to an onsite bridgehead system in case of an emergency. Sure, dial-up is largely worthless, but it will pass an RDP session and it will reload that config file you screwed up.
It might seem anachronistic to go from talking about four layers worth of network redundancy to a dial-up modem as your last line of defence, but it is all part of the same philosophy.
If you cut this cable, how much trouble are you in?
Create a detailed network map including every device and link on your network. Consider one by one what happens if that device fails. If you cut this cable, how much trouble are you in? If you take a gun and shoot that switch, can your network survive?
The challenge is convincing the people with the money to invest in redundancy. They need to know what a network outage can mean to the bottom line. The hard part is changing a culture of "good enough" into one that understands the false economy of ignoring redundancy.
So go ahead and unplug that network cable. There's a backup...right? ®
"Either way, compared to the price of the kit, the price of driving down there, and the price of an hour's worth of outage, the price of a phone line and a modem might be entirely justifyable."
These sort of things are, of course, much easier to justify to bean-counters immediately *after* a prolonged outage, than when you're talking about a hypothetical outage...
Good to see consideration of out of band management there: interesting use for a fax line. The emphasis on lab verification of changes is useful too. That being said, this sounds more of a wish list by someone who hasn't had to deal with either a) hard business realities or b) very complex network failures. I would hesitate to describe myself as any sort of network "guru", but I do build out and support data centres, offices, and WAN links of various sorts, including metro rings, and the biggest causes of long outages are usually poor design (usually too complex), poor software (Brocade gets a special mention here!), and carriers (no comment!): in that order. Doubling and tripling up redundancy can improve failure rates (although I should mention that most vendors won't load balance links properly with IGP's or port channels unless they can be divided by 2), but that ain't necessarily so. My worst outage involved a stray OSPF default screwing up a multi-homed site without out of band management: a single homed site would have had higher reliability over the same calendar year.
Well, that can be arranged... by modem.
Anyhow, while it's not unreasonable to have to justify expenses, assessing the arguments requires domain knowledge that þe average olde tallyer of beanes just doesn't have. That in itself is a hidden source of mis-spending and thus costs. I say it would be interesting to find ways to fix that--being ignorance based, it won't fix itself.