Network monitoring is hard... If only there was some kind of machine that could learn to do it

*AI bursts through wall* 'OHHH YEAHHH!'

Some kind of divine light exploding through a wall

Comment It's difficult not to wish for the "good old days" when workloads stayed put, packets behaved, and firewalls just did basic port-level filtering. Admins knew where they stood.

The cloud has changed that. Workloads and the resources they run have become malleable, virtual machines can be shunted from place to place, traffic volume has ballooned while traffic types have increased, and users can increasingly provision stuff.

Exacerbating the situation, new applications demand high volume and low latency – high-definition video and gaming deliver lots of data quickly and are sensitive to latency – while IoT confounds traditional network monitoring, with devices connecting and disconnecting frequently.

Shervin Shirmohammadi, professor with the school of electrical engineering and computer science at the University of Ottawa and a member of the IEEE MMTC Multimedia Cloud Computing Interest Group, sums up the situation. "Some of these applications are highly interactive, so in those cases network monitors need to catch and fix issues causing poor service quite fast, in the order of 100ms, which is very difficult with old methods and tools."

No wonder analysts think the market for network performance monitoring and diagnostics is set to explode. Gartner thinks it is worth $2.2bn and is growing at a compound annual rate of 15.9 per cent. Market research rival Network Analytics is even more bullish, envisioning a $3.1bn market by 2022.

Driving this is the sheer scale of cloud – the proliferation of devices, greater data, more users and increasing uncertainty. Traditional network monitoring is backwards-looking and manual dashboards will fail to scale as the number of devices, applications and packet routes rise.

If only there were some system that could learn from the present, respond accordingly and anticipate the challenges of the future – to head them off before they become challenges.

What is machine learning?

In network management, AI and machine learning have potential. Both thrive on data – they learn from it and grow – and there are few domains in IT that generate as much data as network management.

What they learn can be turned to our needs, and vendors in the field are beginning to add AI and ML capabilities to their products. These are being used to automate functions – detect real-time performance and availability, identify their root cause and speed up recovery times. It's early days, but you can expect more to come.

AI, meet software-defined networking

The advent of software-defined networking (SDN) should mean a greater role for this data-driven, AI-based approach to network management. SDN separates the control data layer from the physical switches underpinning the network, enabling administrators to configure network operations from a central console.

Abstracting network control into a separate software-based layer makes it easier to collect large amounts of data from the infrastructure and to configure the network quickly. This will provide more fuel for data-slurping AI algorithms. It could also make it easier for them to reconfigure the network directly via software APIs.

This puts the automation of network functions on the map. Shirmohammadi expects the mitigation of problems and optimisation of traffic flow to become more automatic over time as AI algorithms become more adept at analysing historical data and human administrators begin trusting them more.

"AI network analytics can pinpoint a problem that is causing poor service, or allocate resources more efficiently as nodes join and leave at a large scale, both faster than what is possible today," Shirmohammadi told us.

"In the case of a network problem, in addition to finding it, it can also suggest a solution to the network operator, or, in advanced systems, even go ahead and apply the solution without human intervention."

As AI network management gets to this point, the dream is that it should free up fleshbags to concentrate on better architecture design and network strategies.

ClAIrvoyant systems

Machine learning applied to network management isn't just supposed to be good at spotting present problems. The idea is it should be good at predicting events, too. How? By having algorithms process large amounts of existing data and case history in current traffic to extrapolate patterns. This could happen in several ways in a network environment.

One scenario sees AI better understand how network parameters change over time, leading to more accurate capacity planning and enabling procurement teams to kick off provisioning processes at the right juncture.

Another could predict adverse events and anticipate their effects in advance, in a Minority Report-style precog scenario.

Experts think that this may happen because predictive Machine Learning algorithms thrive on all that data, which in this case, is the historical stuff. The more data that they have from the past, the better they become at suggesting what may happen in the future. Computer networks are one of the most data-rich resources.

They constantly generate information about the devices and applications connecting to them and the information they are communicating. By collating that data and running it through a machine-learning model, you can potentially build up an accurate statistical model of how the network will react under certain conditions.

Training day, and tomorrow, and the day after

This all sounds great, but there are challenges – and one of the biggest is training.

AI and ML have an insatiable demand for data. Though there's no such thing as universal AI or ML, you need to pick a model or framework for specific cases. Moreover, the more you want to drill down into how different applications affect the network, the more specific the data for your chosen model or framework will need to be.

"Different applications such as OTT video, IoT, gaming and conferencing need to have different models," Shirmohammadi said. "A training model that works for OTT video will likely not work well enough for IoT. So, an AI network analytics system needs to consider each application differently while also considering cross traffic."

Algorithms may also need to be tweaked – you can't run and forget them. Data scientists have to repeatedly alter and test algorithms to make sure they work properly and are generating the correct outputs, a fact that means near continuous training. AI training algorithms can, however, use techniques for incremental training that don't need retraining from scratch, which can save time and effort.

"This would significantly speed up incremental additions to the model," Shirmohammadi added.

This tweaking may take more work given the many different data types that a competent AI-based network monitoring model would consume. For each application, an AI model can gather traffic traces from many layers, ranging from the application layer through to the transport, networking and physical layer, not to mention the node's own hardware and operating system.

Many metrics can be collected from each of these layers. Just at the transport layer you can collect more than 100 different types of metrics, such as the number of packets, window size, number of bytes in the payload and timestamp.

Developers and AI architects building on such systems must, therefore, be selective on features to overcome this challenge.

AI for network monitoring has promise but – you won't be surprised to learn – isn't a slam dunk. You'll need to dust off the marketing glitter and consider how much time and effort you want to invest in building, training and integrating these algorithms into the large, complex cloud-based networks.

That said, AI does promise to deliver a healthy amount of insight and automation into what's happening and what could be about to happen in the increasingly big and varied networks of the cloud. ®

Biting the hand that feeds IT © 1998–2018