It's 2019. Network and application performance management ought to be married. But how? Hear us out – what about.. AI?

Pros and cons of using machine-learning to end of the era of: 'Not our network's fault. Must be your code, mate'

Sponsored When it comes to IT management, applications are from Mars and network components are from Venus. The two are rarely monitored and managed together. Would you ask your network engineer to analyse some dodgy application logic in your ERP software while investigating a latency problem? Probably not.

Yet application performance management (APM) and network performance management (NPM) really do belong together. Industry and analyst consulting firm Enterprise Management Associates (EMA) asked 300 respondents [PDF] what interdependencies they viewed as critical for IT analytics. The largest number – 49 per cent – cited infrastructure and applications as the two elements that relied on each other the most.

This makes sense because network performance impacts application performance. So why aren't we doing them both together, in a way that lets us see the data from both? And could artificial intelligence help?

These monitoring tools produce data of different types, from different layers of the stack, making them different beasts. The network inhabits the lower layers of the traditional TCP/IP stack. The application protocols reside at the upper layers.

It's true that the two overlap. Network management tools like Paessler's PRTG Network Monitor consumes application-layer protocols like SNMP, for example. But despite the overlap, APM offers different deliverables. It deals with issues like end-user experience monitoring (EUM), user-defined transaction profiling, and the analysis of transaction performance in different application contexts. It offers more analysis of the application logic (ie, what's being done on the server).

All this sits above the traditional monitoring functions you see in NPM. NPM products monitor the passage of network packets and analyse router and switch ports to understand how well they are handling that traffic. You may also find these tools measuring hardware components like CPUs and memory storage for a view of end-to-end network performance.

Another reason for the schism between application and network performance management is organisational. Companies may just not be structured to think about them together. If different teams look after the network and the applications, and have different incentives, then they may even be adversarial. "Performance problem? That's not our network. Must be your software, mate."

As APM and NPM evolve, the challenges for IT operations teams increase. Growing data volumes make it hard to keep up with what's happening on the network or with applications separately, let alone together. Gartner says [PDF] data from IT infrastructure and applications is increasing up to threefold each year. Meanwhile OpsRamp, in a survey of 200 IT pros [PDF], found that sorting the signal from the data noise was the biggest challenge for most (65 per cent) of respondents. How can companies manage that firehose of information while managing application and network management under a single umbrella?

An automated AI future?

According to Gartner, the answer lies in artificial intelligence. It argues that the technology will not only unite previously separate parts of IT operations like APM and NPM, but will usher in a new era of automation.

Imagine software with an omniscient view of the network that can spot patterns in the data that humans may miss. Instead of waiting for a user to report problems accessing data, say, it could raise a trouble ticket itself before anyone realises the problem is emerging.

An even smarter, more sophisticated version of this system might reduce the load on human operators by taking action itself. This would involve using machine learning to identify the most successful mitigation and applying it.

It's all very futuristic, but it is realistic? Gartner certainly seems to think so. It came up with a name for this concept: Artificial Intelligence for IT Operations (AIOps).

AIOps uses machine learning to find patterns in data from the IT infrastructure. Machine learning is a data-hungry process, requiring vast amounts of data to train on, so that it can recognise what's normal and what isn't. These tools capitalise on the rising data volumes from IT infrastructure, slurping the metrics from monitoring tools including NPM and APM.

What data are we talking about? Gartner says everything from wire data to log events. In a statement that had us doing full-body eye-rolls, its 2018 Market Guide for AIOps Platforms [PDF] even imagines gathering social media-derived user sentiment data and folding that into the mix as these tools evolve.

A year earlier in 2017, Gartner proclaimed that 25 per cent of global enterprises will use AIOps to support two or more major IT functions. In went further in that year's Market Guide for AIOps Platforms, predicting that by 2022, 40 per cent of large enterprises will combine big data and machine learning to support and partially replace monitoring. These companies will also replace elements of IT service desks, and it believes that they will also use AI to automate processes and tasks.

The appetite for AI-led APM and NPM integration is certainly there. The EMA survey found that companies were overwhelmingly interested in integrating AIOps across multiple IT operations domains. 65 per cent of them cited this as their main interest, with "big data stores for data search" running a distant second at 12 per cent.

How will this robot BOFH work, exactly?

Does this mean that companies will throw out those APM and NPM tools that they have spent years painfully building in isolation? Not likely. They're a sunk investment for many, and there may be more than one of each in the average large enterprise.

Instead, Gartner believes that AIOps will aggregate data from these systems, working in concert with them rather than replacing them. It will consume the results from specialist tools and use them for cross-domain analysis. So, that's a whole new category of tool for vendors to charge you for. Yay.

How will these AIOps tools work and what will they offer? As aggregation platforms, they will gather data from multiple sources including application and network monitoring tools. This puts the onus on IT teams to create an inventory of their monitoring tools, some of which may stretch back for decades. This is no mean feat; a CA survey of 100 IT pros found that 72 of them relied on up to nine different monitoring tools to support modern applications.

These tools will train machine learning algorithms on this data to arrive at baselines representing normal performance. They will then run the counterpart to AI training, known as inference, to analyse the combined APM and NPM data in two contexts: historical and real-time.

Historical analysis will enable these tools to detect emerging patterns of concern over time in application and network performance data. Real-time analysis can highlight emerging issues that someone must deal with right now.

These tools can detect performance anomalies and correlate seemingly unconnected events that could develop into systemic problems if left unchecked. But Gartner argues that the real benefit will come in the use of AI to automate IT service management (ITSM) tasks. By marrying the high-volume data generated by IT operations with machine learning, Gartner imagines that companies will be able to generate trouble tickets automatically, and perhaps even resolve some of them without human intervention.

Explaining how this concept differs from regular data analytics, the analyst firm describes a workflow in which the AIOps layer processes aggregate data from specialist monitoring tools like APM and NPM, creating sets of prescriptive rules to mitigate problems. It then talks to ITSM tools to kick off trouble tickets relating to the problem. Then, in the most advanced scenarios, it integrates with IT automation tools to kick off mitigation workflows that follow its own suggestions.

This is a nascent use case right now: Gartner says just six per cent of its AIOps-related customer interactions concerned IT service management use cases - but the company clearly considers this a growth area.

AI-based IT operations in the real world

Would this be better than the manual scenarios that we have today? If tools vendors can pull it off, then it would help to eliminate manual bottlenecks and spot problems that human operators might miss. A successful program would reduce event noise by triaging inconsequential infrastructure and application incidents, leaving only the important stuff to worry about.

Gartner also sees potential gains in predictive event alerting, where the system sounds the smoke alarm before the fire has taken hold, giving admins the chance to put it out (or perhaps putting it out for them). Other peoples' numbers bear this out; CA's survey found that predictive analytics was the most popular projected benefit of AIOps, with 65 per cent of respondents hoping for a digital crystal ball.

The real question is whether these advanced use cases are possible. There are some challenges to overcome along the way. Two such challenges are data quality and skill.

Training machine learning systems isn't magic; it's just hard work and statistical manipulation. It needs clean data, meaning that someone has to scrub and validate the information from the NPM and APM systems underpinning it.

Then, someone has to keep training the machine learning algorithm, verifying the results, tweaking the data and the algorithm, and then retraining it to get the best possible results. That isn't something your network manager will do, and it's unlikely that a canned AIOps solution will do this for you out of the box, no assembly required.

Finally, let's be realistic about what we trust AI to do. Filtering the most likely possible event correlations and anomalies and surfacing them to a human operator for further analysis is one thing, and we see some AI-based tools doing this already in the cybersecurity realm. But trusting an AI bot to conduct root cause analysis and then correctly fix the problem on its own is a scary prospect. Get something wrong, as brainless AI engines often do, and your network could end up looking like the workshop in Goethe's Sorcerer's Apprentice.

AIOps-based automation may be an appealing idea in theory, but its rollout will be incremental, and you shouldn't expect to replace your BOFH with HAL just yet. But just as with many AI use cases, it holds possibilities. It'll just take some nurturing and gentle experimentation to see how far an IT team can take the concept.

Sponsored by Paessler.

Sponsored: Balancing consumerization and corporate control




Biting the hand that feeds IT © 1998–2019