Going strictly hands-off: Managing your data centre from afar
Techniques for saving your sanity, and your job
If your core servers – and hence your core applications – live in a data centre, then by definition they're not on your premises.
In many cases they may be hundreds of miles away – in fact, in a previous life, my employer's most distant data centre was six time zones away in the US Midwest.
This means that you don't have the option of wandering into the server room and power-cycling something; instead you need to work hard to make your systems manageable from afar.
Absolutely core to the remote management of data centres is accurate, complete and rigorously updated documentation. You can't just nip and have a peek at stuff, and so you need to be able to rely completely on the documentation for information about the system.
Every connection – power, serial, LAN, the lot – needs to be rigorously documented and a regime of capital punishment initiated to deter people from not updating the docs when something changes.
It only takes one undocumented power or LAN change to make your world fall apart when you confidently but inadvertently disable something crucial because the docs differed from reality.
Similarly, document the front and rear panels of all the devices in the cabinets, along with the possible statuses of all the flashing lights and what each means: we'll come to why in a moment.
On the flipside of documentation is the labelling of everything in the data centre cabinets. Unless you're very close in a geographical sense to your data centre, you're likely from time to time to call on the data centre provider's staff to do something for you – install a new LAN connection, or maybe fit a replacement hot-swap power supply when an old one dies.
So give your devices names, and label them on the front and back. Label every cable a few inches from each end (not right at the end – you won't be able to get at the labels to read them).
And this is why I've had you document the front and rear panels: if your server has two hot-swap power supplies and one dies, you need to be absolutely certain to tell the provider's “intelligent hands” person which one to pull out.
And of course because you documented all the LED status options, you can get him or her to double check before pulling: “It's the one on the left, but before you pull please confirm that the light's flashing yellow, as that signifies it's the failed unit.”
Next on the list we have another core aspect of stuff being a long way away: you can't just whack another disk into the box if you run out of space.
There are so many monitoring tools on the market – and so many free ones – that there's no excuse for not monitoring your data centre to death both to check that everything's healthy but also to do capacity planning and usage trending for key resources.
Run up proper monitoring, preferably in a form that doesn't rely on the data centre being fully functional. How can you send an alert that everything went down if the monitoring server's on one of the boxes that went down? Maybe you could even look to one of the many cloud services that offers system monitoring?