Cumulus Networks adds validation with NetQ
Agents that don't wait for polling
Cumulus Networks reckons netadmins need more than ping and traceroute to understand large-scale data centre networks, and is hoping its NetQ offering will fill that role.
Originally designed for its own Cumulus Linux distribution, the company's decided the NetQ telemetry, diagnostics, and validation software deserved broader distribution.
That way, CEO Josh Leslie told The Register, deployment of the NetQ agents can better mirror what's going on in modern data centres.
In bygone days, IP-based routing ended at routers; these days, Free Range Routing has pushed that a lot further.
“Networking is increasingly bleeding into the hosts – you typically have one host with lots of routable containers or virtual machines,” he said. Any technology trying to track the state of the network benefits from a footprint that reaches into the hosts, he said.
As the company blogs here, the basic of NetQ is that it's a “telemetry-based fabric validation system” – most specifically, it's designed to replace polling-based agents.
Why is that necessary? Leslie told The Register's networking desk the disciplines outfits like Google and Facebook have brought to data centre design do very well at making their bit-barns highly-available, but that introduces an unexpected side-effect.
“The biggest, most resilient networks are very redundant – we have customers that go in, nine months later, you look at the network, and 25 percent of the links are down”.
“The only time anyone looks is when there's a problem … I want to know at 1:45 on Monday when there was an outage, what was the state of all links, was there any BGP change, was there a link down?”
With telemetry streamed from every possible device into a distributed Redis database, Cumulus hopes to expose the occult failures before they start taking out applications.
That's NetQ's diagnostic mode, but Leslie said during development the company found its customers were just as keen on its application as a validation system that lets users check their configuration during production rollout, and roll the network back if something unexpected happens.
That validation works down to Layer 1, as well as network topology, protocols and host environments, the company says. ®