You say, 'AI'. I say, 'Machine learning'. They say, 'Cybersecurity'... What does it all mean?
For one thing: an automated hacker-repelling security geek it is not
Sponsored “Machine learning is eating the world,” writes Clarence Chio in Machine Learning and Security with David Freeman, who heads a team of ML engineers charged with detecting and preventing fraud and abuse across LinkedIn. Chio went on to write that in fact: “Cybersecurity is also eating the world,” and he has a point.
The UK’s National Cyber Security Centre (NCSC) claimed, on its second anniversary in October 2018, it had stopped more than 10 attacks per week, primarily from hostile nation states.
Perhaps, unsurprisingly, the rise in threats has led to a boom in sales of cybersecurity software: it will be a $248bn (£194bn) industry by 2023, according to Markets and Markets research.
Within this, the future for machine learning is bright. A PricewaterhouseCoopers 2018 AI survey predicted 27 per cent of executives would invest in cybersecurity safeguards that employ ML and AI. Yet, there’s confusion.
A Narrative Science 2015 survey reportedly found people at large tend to think of AI in rather simplistic terms – 31 per cent see it as technology that “thinks and acts like humans” – while a range of “similar sounding” terms were being employed interchangeably when talking about AI despite the differences the exist between them. Meanwhile, nearly two fifths of those in B2B marketing – those who’d be responsible for branding and packaging companies’ enterprise technology products – are unclear of the differences between AI, machine learning, and predictive modeling.
AI in cybersecurity has a misperception of an automated security geek that can instinctively rebuff any nonsense approaching the network. In reality, this will not, and should not, happen.
AI is a broad concept that looks to mimic humans – at least the thinking skills. ML is a branch of AI where a machine focuses on specific tasks, using and adapting algorithms depending on the data received. In cybersecurity terms, this translates as a machine that can predict threats and identify anomalies.
ML is already providing value in simple security tasks and elevating suspicious events for human analysis. As Chio suggests in his book: “Spam fighting has been one of the oldest problems in computer security and one that has been successfully attacked with machine learning.” Interestingly, Google claimed last year that it has a 99 per cent accuracy rate in blocking spam. Machine learning software can also flag up suspected phishing domains.
As well as spam prevention and phishing blocking, ML is also touted as a method for detecting malware, speeding up the scanning process and using behavioural analysis to spot malicious software that has evaded signature detection.
But here we seem to be hitting some limitations. Behavioural analysis has been deployed in banking to help fight fraud, and a similar approach to malware can be used – where the system captures, analyses, and processes vast quantities of data to “learn” what are legit transactions, and therefore which ones are bad.
Banking fraud, however, is small fry compared to the world of malware, at least in terms of the number of incidents to track and analyze. As Datanami suggests, there are upwards of one million pieces of malware released in the wild every day. So a machine-learning system would need to scale out to meet this demand, while perhaps employing a complex deep-learning model that adapts and grows as its malware prey evolves.
Welcome back, hardware
Another issue in the machine-learning story is hardware. Your ML system will feed on vast datasets, creating a challenge for those rolling out machine learning in terms of processing and storage. This has raised the profile of GPUs and similar accelerators that can process in parallel the large volumes of matrix math machine-learning code throws at them. Crucially, these are much more efficient at this job than general-purpose CPUs.
Then there’s storage. Flash, which promises low latency and high throughput compared to that enterprise staple of disk, is being promoted by vendors as the best answer to ML storage – with the proviso that performance will depend upon the implementation of your storage subsystem. Flash is about one thousand times faster than disk, which has a latency of tens of milliseconds.
But flash remains relatively expensive – ruling out for many the all-flash, machine-learning-based, cyber-security dedicated systems. There is the cloud, and flash is offered by some providers there.
Peter Firstbrook, Gartner research vice president, has pointed to the general shift to the cloud as an opportunity for organisations to exploit machine learning in solving “multiple” security issues, such as adaptive authentication, insider threats, malware, and advanced attackers.
He believes that this will strengthen and become the new normal, to the extent that by 2025, machine learning will be a normal part of security systems to offset ever-increasing skills and staffing shortages.
But cost of flash remains a factor – even for service providers. That’s seen them rely on disk optimisation using extra processing and memory management with the added option for flash. Providers like Amazon, Azure and Google are, meanwhile, offering GPU-powered virtual machines.
Machine-learning arms race
With all this in mind and with so many prepared to invest in ML for cybersecurity, according to PwC, it’s worth considering a measured approach, and deploying machine learning as part of a multi-layered defense.
We know ML works against spam and phishing attacks, but it is a bit too early for machine learning to be accurate enough and fully trusted in malware detection and prevention. In this setting, newer ideas like behavioural analysis and sandboxing powered by ML should be employed in combination with tried-and-trusted techniques, such as firewalls, intrusion detection and prevention, and web and email gateways.
Nor will ML replace humans when it comes to cyber security – so, sorry, no automated geek. Humans remain – for now, at least – far better than machines at absorbing context and thinking creatively, meaning ML can bolster the activities of humans in IT sec.
Another reason not to depend on ML springs from the belief that algorithms and models could actually make things worse for cybersecurity pros. One way is through data poisoning, where attackers pollute the training data that’s publicly available and that is consumed by ML models during training. This would mean hackers can subvert systems destined to be responsible for recognising what’s “good” or “bad.”
This risk is compounded when we remove humans from the ML training equation – we lose the ability to confirm the machine model is doing the right thing.
It’s important, therefore, to ensure machine learning output is human readable, so that systems’ decisions can be scrutinised. Too much ML is black box stuff, producing results without explaining the “how” or the “why”, which are factors critical in helping people confirm accuracy and to improve results.
Filling the gaps
Trend Micro in its Exploring the Long Tail of (Malicious) Software Downloads report found, in its attempt to survey historical software download data, it was able to identify less than 17 per cent of 1,791,803 fetched software files as either benign or malware laden. Translated: the contents of more than 83 per cent of downloads were still completely unknown two years after they’d first been observed.
To tackle this, Trend Micro built a rules-based machine-learning classification system to mine downloads and produce human-readable file classifications so humans can determine whether downloads were benign or malicious. According to Trend Micro, its system not only increased the number of samples labelled by a factor of 2.3 but, by being human readable, let analysts interpret and verify results.
Beyond data poisoning, there exists the fear that attackers will simply turn the power and scale of ML against us. We have, for example, already seen malware attempt to figure out whether it was running in a virtual or a test environment, or on a real endpoint, before activating the attackers’ desired action.
Chio warns that there is nothing stopping the cyber criminals turning the tables and using ML to automatically tailor phishing messages to fit interests gleaned from social media.
When talking AI in cyber security, then, it’s important to step back from the hype and see things for what they presently are: that it’s machine learning doing the heavy lifting. That ML should be employed as part of a layered security system – a system that draws on a range of systems and protections and that retains human input.
When building machine learning into threat prevention, we must also consider the possibility that our adversaries are also employing ML. Failure to do so could see malware writers get the upper hand in this battle.
Sponsored by: Trend Micro.