Original URL: https://www.theregister.com/2006/06/10/machines_analyse_malware/

Researchers eye machines to tackle malware

Automation eliminates human error

By Robert Lemos, SecurityFocus

Posted in Security, 10th June 2006 07:02 GMT

The reverse engineer - better known amongst security researchers by his nom de plume, Halvar Flake - created an automated system for classifying software into groups, a process for which he believes machines are much better suited.

Research using the system has underscored the sometimes-arbitrary decisions humans make in classifying malicious programs, he said. Among other anomalies, he found that Sasser.D has only a 69 per cent correlation to previous members of the Sasser family, while two examples of bot software, Gobot and Ghostbot, are more similar.

"It's like putting donkeys and bunnies in the same class because they both have long ears," Dullien, the founder and CEO of reverse-engineering tool maker Sabre Security, said in a recent interview.

The current problems with classifying and naming viruses are among the reasons that automated classification technology has once again become a focus of research. The plethora of names for specific malicious programs has caused confusion amongst consumers, despite a project that seeks to provide guidance, if not to consumers, to software analysts and incident responders.

In January, when a new computer virus appeared on the internet, anti-virus companies rushed to issue alerts and inundated consumers with a confusing array of names: Blackmal, Nyxem, MyWife, KamaSutra, Blackworm, Tearec and Worm_Grew all describe the same mass-mailing computer virus.

Several research projects hope to improve upon that record.

Last month, at the annual conference of the European Institute for Computer Anti-Virus Research (EICAR), Microsoft released early results of its development of a system to automate classification of malicious software based on the actions performed by the code at runtime.

"A significant challenge we have today is the large number of active malware samples, totaling on the order of tens of thousands, and increasing rapidly," Microsoft researcher Tony Lee said in a recent blog posting following the conference. "It has become apparent to us that the traditional manual analysis process is not adequate in dealing with malware of this order of magnitude, and that we should seek automation technologies to aid human analysts."

The researchers modeled a piece of malicious software as the series of actions that the software takes at the operating system level. Referred to as "events" in a paper written by Lee and anti-malware program team manager Jigar Mody, the actions can include data copying, changing registry keys and opening network connections.

The researchers then trained a recognition engine using an adaptive clustering algorithm - similar to self-organising maps - and classified a previously unseen subset of malware using the trained system. Using more clusters typically resulted in better classification. When the software samples were classified based on 100 events, accuracy fell below 80 per cent, while classification based on 500 and 1,000 events typically has accuracy rates above 90 per cent.

Reverse engineer Dullien takes a different approach. Working with other researchers at Sabre Security, he used automated tools to deconstruct the actual code of virus and bot software, removing any common libraries that the code might use and then comparing the relationships between functions to characterise the software.

Using a database of 200 samples of bot software, a test case for the automated process resulted in two major families of code, three smaller groups, and several pairs and singletons. The system also identified variants of bot software not recognised by a signature-based anti-virus system.

Dullien believes that static analysis is a better approach to malware classification than Microsoft's runtime analysis. Actions that a malicious program does not perform right away - known as time-delayed triggers - can foil runtime analysis, he said. And virus and attack-tool writers could add a few lines of code to a program to confuse runtime analysis, he added.

"The approach presented in the paper can be trivially foiled with very minor high-level-language modifications in the source of the program," he stated in a blog entry analysing Microsoft's system.

Microsoft declined to make its researchers available for interviews. However, in the paper, the authors argued that a combination of both static analysis and runtime analysis would likely perform best. For example, static analysis appears to deliver results more quickly; Microsoft's behavioral classification requires three hours to cluster 400 files at the 1,000 event limit, according to the paper.

In some ways, software classification resembles the state of biological classification back in the time of Carl Linnaeus. The 18th century botanist pushed the scientific community of his day into accepting a hierarchical classification system for plants and animals. However, early classifications relied on external similarities, much in the way that many of today's classifications rely on external attributes of programs rather than their internal processes.

At least one other project hopes to help human analysts do a better job of classification.

OffensiveComputing.net, a project founded by researchers Val Smith and Danny Quist, aims to create a database of malware that records a number of basic attributes of the code, including checksums, anti-virus scanner results, and what type of packer the malware uses to compress itself. The project started in response to the increase in code sharing amongst virus and attack-tool writers and the faster development of exploits and the faster incorporation of those exploits into existing malicious software, OffensiveComputing's Smith said.

"The biggest benefit is more rapid response to complex threats. As the synergy between viruses, Trojans, worms, rootkits and exploits grows, waiting for a solution becomes more dangerous."

OffensiveComputing's database gives incident response workers and analysts access to meaningful data about malicious software, which is especially necessary until automated analysis programs, such as Microsoft's and Dullien's classification systems, mature. The project strives to be adaptable, involve the community, have measurable results, and remain open, Smith said.

"There is an arms race going on between analysts and malware authors, so any solution will have to keep pace with advances on both sides."

This article originally appeared in Security Focus.

Copyright © 2006, SecurityFocus