What's wrong with network monitoring tools? Where do I start...

Original URL: https://www.theregister.com/2013/11/13/what_is_wrong_with_network_monitoring_software/

That red screen? It's just embarassment

Posted in Networks, 13th November 2013 08:31 GMT

Opinion For as long as I can remember I've worked in an environment where there's a screen on the wall showing the status of the company's systems. Or actually, in one case, showing the status of the company's systems unless there was a test match on.

From time to time that information's been useful. Unfortunately, most of the time we've known that there's a problem because half a dozen users have called to raise tickets – the screens haven't necessarily updated in time, and when they have I've had to correlate in my head the impact of the fact that I've just been told that port 12 on switch 3 has gone down.

I've seen dozens of monitoring packages, and they've all been hideously inadequate. Some have been hideously expensive alongside their hideous inadequacy. So why is this? Why does nobody write monitoring packages that actually monitor stuff and tell you what you need to know when you need to know it?

Dodgy protocols

To be fair to monitoring software vendors, they're off to a bad start because the tools available to them are simply appalling.

SNMP (the Simple Network Management Protocol – though frankly there's nothing simple about it) is unwieldy and clunky to use, but we're stuck with it because its longevity has made it ubiquitous. Let's face it, nobody with any sense is about to try to produce an alternative because the barriers to entry into the market are insurmountable.

WMI (Windows Management Instrumentation) is actually very good, but of course it's a Microsoft-only concept so you're stuck with using it only on your Windows estate. Finally you have Syslog... well, you can give a simple priority to each type of alert but the content is largely unstructured and so the usefulness is limited.

Protocol-driven software

The next problem is that many monitoring engines are written by people who understand the protocols but have never really had to monitor anything in real life. So it's all oriented around comparing CPU usage with thresholds, alerting when a switch interface has gone down, and so on.

I've yet to use a monitoring tool that looks like the first step in its development was to send a bunch of analysts to interview network managers and say: “OK, what do you want to be able to do?”

Or if they have, they've gone back to the developers who've said: “Sorry guys, SNMP can't do that, we'll just have to make the dashboard prettier and hope people won't notice it's the same as before.”

So what would the analysts find? Let's imagine, then, that I'm an infrastructure manager and one of the aforementioned analysts descends on me for a couple of hours. What would I be saying I want? Well, here are my top 10.

1. Wildlife camera feature

The camera crews that follow Sir David Attenborough around are these days blessed with cameras that are constantly recording – the last few seconds/minutes of footage are retained and overwritten in a loop. When something interesting happens they hit the “Record” button and the last few seconds/minutes are committed to storage. This means they don't have to have the trigger finger of John Wayne on speed. I want that for my core network ports: when I have a problem, the traffic I care about is what has flowed for the past five, 10, 15 minutes so I want to retain it for a sensible amount of time.

2. Filter by device

If a switch lights up red on the monitoring screen, I want to click on it and pop up the alerts and Syslog entries that relate to it. If a port lights up I want to see that data filtered for that port.

3. Muppet detector

I want the network monitoring package to tell me that the end-to-end connection between a virtual server and the backup server is inefficient because one of the eight or 10 LAN ports the traffic is traversing hasn't got Jumbo Frames turned on.

4. Which way?

I want to see (visually and legibly) the path used by traffic between two endpoints. That means understanding what the load balancer is doing, figuring out which of the physical nodes in a Virtual Router Redundancy Protocol group is carrying the traffic, and so on. And when you've done it, show me the step-by-step operation of the application traffic so I can see where the delays are (and do it at application level, please, so that I can see that, say, the network is fast but the app is being killed by DNS timeouts).

5. Virtually comprehensible

I need my management package to understand the hypervisor layer so that I can do a packet capture on the virtual NIC of a virtual machine and the physical port to which it is eventually plumbed through the virtual infrastructure. Actually one of the switch vendors (Enterasys) has shown me this type of intelligence, but that's the exception rather than the norm.

6. Sensible discovery

Why are so many auto-discovery functions so bloody awful? I'd love a monitoring package that does some sensible discovery and presents you with something other than a single page with 950 overlapping icons? RiverSoft was one of the better offerings I've seen, but that was properly expensive, and actually Cabletron Spectrum had some clever bits in it, but the ones I've used lately are pretty awful.

7. Spanning Tree

Draw me a picture (a comprehensible one, please) of my Spanning Tree topology, highlight the root bridge(s), and report when changes take place. And don't tell me you can't do it because if the switches in the network know by talking to each other what their STP world looks like, you can use that traffic to figure it out too. Listen to the BPDUs and draw a picture.

8. Idiotproof GUI

I'll be slightly kind to the monitoring software vendors and point out that they're certainly not alone in this one: I've been reviewing software packages since 1994 and have come across probably half a dozen GUIs – across all product types – that I've considered brilliant.

But why are so many monitoring screens so unintuitive? Remember that in many cases the people using these screens are junior, inexperienced, level-1 support staff: this means it needs to be simple and understandable. I should be able to group devices together by dragging them around, or by saying “Everything in subnet XXX”, or “Everything whose name matches this pattern”... and yet so many packages don't let you do this. And I want to be able to click on anything and go to a screen showing me the data I think I want to see.

9. Storage should understand storage

Show me IOPS counts on my iSCSI subnet, and picture my Fibre Channel switch zoning. And, for that matter, tell me I've buggered up the resilience in the zoning. With iSCSI increasingly high-performing (hardware-based iSCSI can be screamingly fast) it's becoming more popular, but monitoring screens are seldom over-burdened with storage information.

10. Application-centricity

The monitoring package must understand the applications. As the infrastructure guy I'm not actually responsible for keeping the infrastructure running; I'm responsible for providing the resources that the apps need in order to run. The sales director probably doesn't care that one of my WAN links is down, for instance, because I have resilient links and the failover was seamless when someone put a digger through the fibre. So by all means let me know that something's down, but report at multiple levels with different views for application owners, business managers and the like. So when a fibre breaks, turn my screen bright red but turn the sales manager's a tasteful shade of pastel amber, so he knows something's wrong but his service is performing normally.

How to do it

Achieving all of the above is non-trivial, but most of the tasks are do-able. For instance:

As I've said, the Spanning Tree traffic on a network provides enough information to build a picture. So do it, please.
You can improve your GUI by talking to people who use it and employing usability analysts.
The APIs exist to allow you to figure out the physical path from A to B, even in a virtual environment. Enterasys's own management suite does it very well, for instance, so why don't we see it in mainstream monitoring apps?
Some management packages are able to download configs from switches and routers and highlight changes, but they don't take the next step of actually parsing the config and understanding it. By definition these are structured files which provide an accurate representation of the config (the router wouldn't work otherwise!), so use them to understand the configuration. Add in some ISDP/CDP/LLDP information and you're starting to build the end-to-end picture.
Time-limited packet capture should be trivial, particularly with the ability to do virtual span ports and the like in your VM infrastructure.
If you're monitoring packets you can figure out (say) that a DNS query didn't get a response, or that it took 10 seconds. Ever seen a Bounce Diagram in CompuWare's application analysis tools? Shouldn't be too hard to come close for a network equivalent.
Yes, Syslog data is pretty unstructured but if you take (say) the top five vendors in each market you should be able to find enough patterns in the messages to make meaningful decisions. At least one vendor out there is doing natural language processing on Syslog messages, interestingly – I'm quite looking forward to seeing their package in the next week or two.

Conclusion

Monitoring companies need to take a step back and consider whether their tools are really giving customers what they need. My feeling is: no, they're not – instead they're confining themselves to what they know and giving the infrastructure manager just enough to keep things running and react just about quickly enough.

But as I see it there's huge scope to add a number of user-oriented features that would place the first company to do it head and shoulders above the rest. ®