What's wrong with network monitoring tools? Where do I start...
That red screen? It's just embarassment
Opinion For as long as I can remember I've worked in an environment where there's a screen on the wall showing the status of the company's systems. Or actually, in one case, showing the status of the company's systems unless there was a test match on.
From time to time that information's been useful. Unfortunately, most of the time we've known that there's a problem because half a dozen users have called to raise tickets – the screens haven't necessarily updated in time, and when they have I've had to correlate in my head the impact of the fact that I've just been told that port 12 on switch 3 has gone down.
I've seen dozens of monitoring packages, and they've all been hideously inadequate. Some have been hideously expensive alongside their hideous inadequacy. So why is this? Why does nobody write monitoring packages that actually monitor stuff and tell you what you need to know when you need to know it?
To be fair to monitoring software vendors, they're off to a bad start because the tools available to them are simply appalling.
SNMP (the Simple Network Management Protocol – though frankly there's nothing simple about it) is unwieldy and clunky to use, but we're stuck with it because its longevity has made it ubiquitous. Let's face it, nobody with any sense is about to try to produce an alternative because the barriers to entry into the market are insurmountable.
WMI (Windows Management Instrumentation) is actually very good, but of course it's a Microsoft-only concept so you're stuck with using it only on your Windows estate. Finally you have Syslog... well, you can give a simple priority to each type of alert but the content is largely unstructured and so the usefulness is limited.
The next problem is that many monitoring engines are written by people who understand the protocols but have never really had to monitor anything in real life. So it's all oriented around comparing CPU usage with thresholds, alerting when a switch interface has gone down, and so on.
I've yet to use a monitoring tool that looks like the first step in its development was to send a bunch of analysts to interview network managers and say: “OK, what do you want to be able to do?”
Or if they have, they've gone back to the developers who've said: “Sorry guys, SNMP can't do that, we'll just have to make the dashboard prettier and hope people won't notice it's the same as before.”
So what would the analysts find? Let's imagine, then, that I'm an infrastructure manager and one of the aforementioned analysts descends on me for a couple of hours. What would I be saying I want? Well, here are my top 10.
1. Wildlife camera feature
The camera crews that follow Sir David Attenborough around are these days blessed with cameras that are constantly recording – the last few seconds/minutes of footage are retained and overwritten in a loop. When something interesting happens they hit the “Record” button and the last few seconds/minutes are committed to storage. This means they don't have to have the trigger finger of John Wayne on speed. I want that for my core network ports: when I have a problem, the traffic I care about is what has flowed for the past five, 10, 15 minutes so I want to retain it for a sensible amount of time.
2. Filter by device
If a switch lights up red on the monitoring screen, I want to click on it and pop up the alerts and Syslog entries that relate to it. If a port lights up I want to see that data filtered for that port.
3. Muppet detector
I want the network monitoring package to tell me that the end-to-end connection between a virtual server and the backup server is inefficient because one of the eight or 10 LAN ports the traffic is traversing hasn't got Jumbo Frames turned on.
4. Which way?
I want to see (visually and legibly) the path used by traffic between two endpoints. That means understanding what the load balancer is doing, figuring out which of the physical nodes in a Virtual Router Redundancy Protocol group is carrying the traffic, and so on. And when you've done it, show me the step-by-step operation of the application traffic so I can see where the delays are (and do it at application level, please, so that I can see that, say, the network is fast but the app is being killed by DNS timeouts).