Data-mining technique outs authors of anonymous email
Unmasking trolls, one 'write-print' at a time
Engineers and computer scientists say they have devised a novel method for identifying authors of anonymous emails that's reliable enough to be used in courts of law.
In a series of papers published over the past few years, the researchers from Concordia University in Montreal have described what they say is the first ever data-mining algorithm for identifying the most plausible author of an anonymous email. It works by establishing a “write-print” of each suspected author by quantifying unique patterns in each individual's email writings. It can be used to unmask authors of emails used in spam, phishing cyberbullying and other types of offenses.
“Our insight is that the write-print of an individual is the combinations of features that occur frequently in his/her written emails,” the researchers wrote in a paper (PDF) first published in the publication Digital Investigation. “The commonly used features are lexical, syntactical, structural and content-specific attributes. By matching the write-print with the malicious email, the true author can be identified.”
Characteristics include word usage, word sequence, common spelling and grammatical mistakes, vocabulary richness, hyphenation and punctuation.
The new approach differs from previous methods by filtering out characteristics found in two or more of the suspects' writing styles. So-called decision tree methods often attempt to use the same set of features to deduce the write-print of different suspects. By excluding the styles that multiple suspects share, the technique attempts to generate a unique signature for each potential author under investigation.
At the heart of the method is an algorithm known as AuthorMiner. It mathematically extracts frequent patterns found in suspects emails and then filters out those that are common to other suspects. It then compares the anonymous email with each of the generated write-prints to identify the closest match.
To test the method, they used it on a set of more than 200,000 emails written by 158 employees of Enron before the energy company was exposed for financial fraud. When finely tuned, the technique identified the author about 80 percent of the time.
Additional papers from the researchers – who include Farkhund Iqbal, Rachid Hadjidj, Benjamin Fung, and Mourad Debbabi – are available here. ®
Correct 80% when finely tuned.
So, wrong 20% when finely tuned and even more wrong when not in perfect lab conditions.
So, hanging at least 1 in 5 innocent men is OK then....... FAIL as this should *never* be accepted as evidence in court!
...... you underestimate its accuracy. Apparently they tested it on the message boards of the Daily Mail, and it correctly identified that 87.4% of the postings had been written by the Twat-O-Tron.
So, let's see...
"When finely tuned, the technique identified the author about 80 percent of the time."
In other words, they think a 20% failure rate is "reliable enough to be used in courts of law"?
Well, in combination with other evidence it might be, I suppose. But given the "believe anything the computer says" attitude of some people I doubt it.