What's wrong with network monitoring tools? Where do I start...

That red screen? It's just embarassment

Next gen security for virtualised datacentres

Opinion For as long as I can remember I've worked in an environment where there's a screen on the wall showing the status of the company's systems. Or actually, in one case, showing the status of the company's systems unless there was a test match on.

From time to time that information's been useful. Unfortunately, most of the time we've known that there's a problem because half a dozen users have called to raise tickets – the screens haven't necessarily updated in time, and when they have I've had to correlate in my head the impact of the fact that I've just been told that port 12 on switch 3 has gone down.

I've seen dozens of monitoring packages, and they've all been hideously inadequate. Some have been hideously expensive alongside their hideous inadequacy. So why is this? Why does nobody write monitoring packages that actually monitor stuff and tell you what you need to know when you need to know it?

Dodgy protocols

To be fair to monitoring software vendors, they're off to a bad start because the tools available to them are simply appalling.

SNMP (the Simple Network Management Protocol – though frankly there's nothing simple about it) is unwieldy and clunky to use, but we're stuck with it because its longevity has made it ubiquitous. Let's face it, nobody with any sense is about to try to produce an alternative because the barriers to entry into the market are insurmountable.

WMI (Windows Management Instrumentation) is actually very good, but of course it's a Microsoft-only concept so you're stuck with using it only on your Windows estate. Finally you have Syslog... well, you can give a simple priority to each type of alert but the content is largely unstructured and so the usefulness is limited.

Protocol-driven software

The next problem is that many monitoring engines are written by people who understand the protocols but have never really had to monitor anything in real life. So it's all oriented around comparing CPU usage with thresholds, alerting when a switch interface has gone down, and so on.

I've yet to use a monitoring tool that looks like the first step in its development was to send a bunch of analysts to interview network managers and say: “OK, what do you want to be able to do?”

Or if they have, they've gone back to the developers who've said: “Sorry guys, SNMP can't do that, we'll just have to make the dashboard prettier and hope people won't notice it's the same as before.”

So what would the analysts find? Let's imagine, then, that I'm an infrastructure manager and one of the aforementioned analysts descends on me for a couple of hours. What would I be saying I want? Well, here are my top 10.

1. Wildlife camera feature

The camera crews that follow Sir David Attenborough around are these days blessed with cameras that are constantly recording – the last few seconds/minutes of footage are retained and overwritten in a loop. When something interesting happens they hit the “Record” button and the last few seconds/minutes are committed to storage. This means they don't have to have the trigger finger of John Wayne on speed. I want that for my core network ports: when I have a problem, the traffic I care about is what has flowed for the past five, 10, 15 minutes so I want to retain it for a sensible amount of time.

2. Filter by device

If a switch lights up red on the monitoring screen, I want to click on it and pop up the alerts and Syslog entries that relate to it. If a port lights up I want to see that data filtered for that port.

3. Muppet detector

I want the network monitoring package to tell me that the end-to-end connection between a virtual server and the backup server is inefficient because one of the eight or 10 LAN ports the traffic is traversing hasn't got Jumbo Frames turned on.

4. Which way?

I want to see (visually and legibly) the path used by traffic between two endpoints. That means understanding what the load balancer is doing, figuring out which of the physical nodes in a Virtual Router Redundancy Protocol group is carrying the traffic, and so on. And when you've done it, show me the step-by-step operation of the application traffic so I can see where the delays are (and do it at application level, please, so that I can see that, say, the network is fast but the app is being killed by DNS timeouts).

The essential guide to IT transformation

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Object storage bods Exablox: RAID is dead, baby. RAID is dead
Bring your own disks to its object appliances
VMware vaporises vCHS hybrid cloud service
AnD yEt mOre cRazy cAps to dEal wIth
prev story


Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
7 Elements of Radically Simple OS Migration
Avoid the typical headaches of OS migration during your next project by learning about 7 elements of radically simple OS migration.
BYOD's dark side: Data protection
An endpoint data protection solution that adds value to the user and the organization so it can protect itself from data loss as well as leverage corporate data.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?