Want a unified data centre? Don't forget to defrag the admins

Original URL: https://www.theregister.com/2013/11/03/defragging_the_data_centre/

And make sure you untick the 'wally' box

Posted in Systems, 3rd November 2013 20:38 GMT

An effective data centre is more than just some racks of servers with a bit of networking and storage attached.

It needs to be versatile, easy and quick to flex and reconfigure, both manually and automatically, and it needs to keep up with the demands of the applications that run there.

Historically, though, many of the components of the data centre have been purchased and installed separately. How can we pull all these components into a coherent whole? How, in other words, do we defragment our data centre?

If it’s broke, fix it

There is no point in trying to build a fence with rotten planks. No matter how well you nail them together, you will be playing hunt-the-tortoise before you have put your toolbox away.

So before you start considering how to make your kit work together, you need to ensure that the standalone elements are up to the job and configured in such a way that they stand a chance of working as hoped.

This is often a very simple matter of checking for obvious problems. Over the years I have seen some interesting setups: one example was a data centre LAN with servers and storage connected via multi-gigabit trunks, whose layer 3 functionality was offloaded to a low-end router that couldn't route at more than a couple of hundred Mbps.

Another was a chassis-based server setup that aggregated hundreds of virtual machines over a puny 4Gbps uplink.

There was also the dual-Gigabit LACP trunk where they had forgotten to enable LACP on one end; and a backup solution that would have gone twice as fast if someone had configured some simple network parameters properly. (We will come back to that last one later.)

So before you do anything take a step back and say: “Have we done something daft?” And if the answer is yes, fix it. Don't take forever over it but at least make sure you have unticked the “wally” box.

Contact details

Once things are looking a bit more sensible, you need to consider the touch points between the components of the infrastructure.

My personal favourite is where VMware ESXi hosts connect to the LAN. With my network manager hat on, I once spent a happy afternoon with the server guy and Mr Google. At the end of it we had quadrupled the speed of some LACP-trunked ESXi hosts just by working methodically through the LAN port config on both ends and reconfiguring it.

This is where you need your subject matter experts to come together and collaborate – perhaps for the first time. The backup instance I mentioned earlier was a classic illustration of the need for collaboration.

You need the configuration to be the same on all the components: the link from the storage to the backup server (if it is iSCSI connected), the backup server operating system, the switch and router ports at every point from the backup server to the device being backed up, the virtual switches (if you are in a virtual environment) and the guest server operating system.

One misconfiguration and everything is going to go at the speed of the lowest common denominator. To get it right you need the storage, network and server guys to get their heads together and set it up together.

Extend the thinking to all the core touch points, then, starting with the obvious ones. Some examples are:

Storage/servers: get the trunking and framing optimal and if you are using software-based iSCSI initiators in a virtual server setup, buy yourself some hardware iSCSI-capable adaptors instead.
Storage/network: trunking and framing again, and if you are using SAN-based Fibre Channel switches, zone them correctly.
Servers/network: trunking and framing yet again. Validate the bandwidth available to the servers and if it is multiply connected, monitor the traffic and tweak the settings to spread the load nicely. You will sometimes find that one link gets hammered while others do nothing, which is of course inefficient.

Consider also the number of places where your servers and storage touch the network. In a virtual server setup there is really no excuse, for instance, for backing servers up over the production subnet/VLAN – particularly if the backup server is on a different VLAN from the devices being backed up. It just means your router is being hammered.

Instead drop in a dedicated VLAN for backups, plumb it into your vSwitches (no downtime required) and add a dedicated backup NIC to each of the servers you are backing up (again, there is generally no downtime). And if you are being really sensible you will do backups at hypervisor level anyway, instead of agent-based ones on each virtual machine’s guest operating system.

Make it a double

One of the things you find in infrastructure technology is that you can often do the same thing in more than one place.

Take data compression, for example. All proper storage sub-systems have their own built-in compression or deduplication options, but then if you work up through the layers you will often find that they provide equivalent functionality (particularly the server virtualisation).

You can even turn on data compression right at the top of the stack on a Windows server, but of course that is only for the criminally insane.

Similarly, in network switching your hypervisor is able to pass traffic between some virtual machines (notably those in the same VLAN and on the same physical host) without the physical LAN switches ever seeing it.

This is where you take the next step of collaboration. You have already got your service owners together, so now is the time to bring in the vendors as well.

Where you can do the same thing in two places, get the vendors involved

One particular storage vendor recently told me that the best way to eke performance out of its kit was to get the VMware guys to “thick provision” virtual disks (set aside the entire amount of space instead of letting VMware grow them as required).

Similarly, I asked a virtualisation guy a while ago whether I should be turning on the optimisation options on the storage or the hypervisor.“Both,” he answered very firmly.

So where you can do the same thing in two places, get the vendors involved and make sure you go for the best combination of options. This will require your internal experts to work together, as you will often need to change two systems at once to eliminate needless downtime.

Call the experts

If you are thinking at this point that we have forgotten the application guys, don't worry: we haven't.

While the server, storage and network experts are all working together they need to consider the application team as customers – in part at least. In return, the application people need to take a step back and think, instead of simply clicking Next, Next, Next in their installers.

How many applications do we all have where the app is back-ended by a relational database, and that relational database is in fact a copy of SQL Server Express that was installed by the app installer? (Answer: shedloads).

How many of those databases are properly backed up – or for that matter even known to the servers and storage teams? (Answer: not many, I suspect).

While we are at it, how many application servers are actually provisioned correctly in terms of RAM and CPU?

It is very easy to set up a server based on the system requirements of the apps that will be running on it, but it is rare to see any real scientific analysis done, particularly in a virtual world where you can over-provision CPU on servers and let the hypervisor share out the spare cycles for you.

So the application teams need to work with the server and storage teams in particular. One especially forward-thinking organisation I worked with had a dedicated SQL Server cluster which serviced the back-end database requirements for the front-end apps that needed it.

It was resilient, it was efficiently provisioned, it saved the company a packet in licensing (many apps were too big to fit into the limits of SQL Server Express and so needed a commercial version) and it was backed up properly.

Group therapy

So the main thing you need to do to defrag your data centre is to defrag the teams that look after it.

At the very least, even if you leave the application team on the periphery and consider them as a service user (or a customer) of the infrastructure, you need to have the infrastructure team under one roof, preferably headed by a single manager.

I have worked with companies where servers, storage and networks were looked after by separate teams. In some it worked, and in others it didn't.

Quite recently I also worked in a company where the key service owners and specialists for servers, storage and networks (the latter being me) all sat within 20 feet of each other, and it was by far the most effective team I have ever worked in.

This is not to say everything was centralised – in fact each of us had technical staff based in other continents – but co-ordination lay with the central team and if there was a major issue we could convene in minutes. We could work through the diagnosis as a group instead of a finger-pointing mess.

In short: if you have service teams in silos, your data centre is likely to run in the same way. Give the setup a onceover and bring the people together to make it better.

Most importantly, don't be frightened to bring in the vendors as part of the extended team. Your particular combination of systems will almost certainly have unique interactions that the vendors’ specialists can help you with.

You may well surprise yourself by how quickly you have achieved a coherent, unified infrastructure in your data centre. ®