Original URL: http://www.theregister.co.uk/2006/08/30/online_anonymity/

Guidelines needed to protect anonymity

It's an information free for all

By Mark Rasch

Posted in Security, 30th August 2006 09:24 GMT

In early August, officials at America Online released information about searches being conducted by AOL members and users of the AOL search tool. This historical data was released onto the internet by several AOL officials to demonstrate how useful such data could be for tracking patterns, uses and interest of AOL members.

The data was anonymised, with members being assigned random ID numbers instead of userid's or names, and was only online for a few days.

The New York Times demonstrated, however, how easy it was to take that anonymised data, and with a few keystrokes, determine the identity of the searcher, and their personal interests, likes and dislikes – indeed to create a profile of users from this anonymized data.

The persons responsible for the "data breach" at AOL were fired – more for a public relations problem than anything else. The case demonstrates how any database, once collected, can be misused, and the significant lack of legal protection for similar information.

Personally identifiable

Privacy laws, both in the United States and abroad generally protect the collection, dissemination and use of "personally identifiable information" of various types and classes. This includes, for example things such as identifiable banking or financial information, personal health information, credit card or payment card information, and personal communications (for example, contents of emails).

Aggregated information on the other hand is not generally afforded the same level of protection. Thus, information about trends, overall internet use, health care utilisation, overall buying patterns, and the like is generally treated as the property of the institution that creates, collects, stores or collates this information.

If it is easy to convert the aggregate information into identifiable information, it may be afforded some level of protection, or may still be treated as identifiable information.

For many companies, there is a blurring of the lines between personal information (that is information about ME) and aggregate information. So, for example, Google collects information about every single thing I look for – every search request, the contents of everything delivered, what I click on, where I go from there.

It keeps both the aggregate information (how many people buy stuff off those ads on the side) and the personal information (tell me everything YOU have looked at this month). The aggregated information is analysed, processed, sold, and used by Google to increase advertising revenue, do load balancing – all kinds of things.

The same is true of ISPs and ecommerce sites. They collect and analyse massive amounts of information about even the most intimate details about you – who you chat with, who you email, what you read, what you post, and potentially even the source, destination and length of your VoIP calls.

Unless they have agreed not to in a Terms of Service agreement, there is virtually nothing preventing them from using this data, in an aggregated and "anonymous" fashion, and very little preventing them from using it otherwise.

Governments – particularly the US government – have taken advantage of this fact to attempt to obtain massive amounts of information. For example, during the course of litigation involving the government's efforts to prohibit materials that are "harmful to minors" the US government subpoenaed from the largest search companies (Yahoo!, MSN, and Google) massive amounts of such aggregate information.

When they got the cooperation of various telephone companies to turn over massive amounts of telephone calling records (non-content information) they apparently argued that such aggregated information (in that case not anonymised) was not entitled to legal protection.

The problem is, as The New York Times learned, it is relatively easy to convert this anonymised information into pointers to learn its source.

More destructively to privacy, once you know the source, you end up learning a tremendous amount of information about the source. Medical publications, for example, routinely publish information about patients, but attempt to "anonomise" this information by stating things like "patient X, a 24-year-old Caucasian from Mobile, Alabama presented to the Emergency Department with a history of..."

Given enough material with which to cross reference the data, you can probably figure out who the patient was, and from this, his entire medical history. Massive databases make this job easier and therefore threaten privacy.

This problem is illustrated in a recent patent application filed by Amazon.com. On 10 August, Amazon filed a patent for something called a "gift cluster" that would look at a person's past purchasing history (including things they looked at but didn't buy), what they ordered, what they had delivered and where (zip code and other demographics), together with other information they either collect or purchase from others (linking other massive databases) to profile their users.

OK, well, they already do this. Indeed, according to their Terms of Service, they use this information to share with their partners and affiliates, and to suggest items for me to buy next time I log in. That's why Amazon keeps trying to sell me these silly lawyer books by Scott Turow.

The patent application also says that Amazon will essentially be able to profile me in terms of my religion, income, purchasing habits, and even sexual orientation.

I am reminded of an episode of The Mind of The Married Man, a now defunct HBO series, in which the lead character’s Tivo recorder decided that he was gay, and suggested programming like Liza Minnelli specials for him to watch. Even this fact in the Amazon patent application is not so startling, and one assumes that they are already doing this.

What the Amazon patent allows Amazon to do is to suggest gifts for me based on the information they have collected and the ubiquitous "other information". It is not clear in the application whether this will be done with or without my knowledge or consent. Thus, the HAL 9000 computer at Jeff Bezos’ office may decide that I like golf, and suggest that my cousin buy me new clubs for my birthday (please don't.) Or that I need a new Treo 750 (please do).

It is possible that, under this patent, a person could ask Amazon to suggest a gift for me, and in that way, essential "mine" the database, and learn my likes, dislikes, preferences, etc. This is one of the ways that anonymous databases (in this case, an anonymous database about a known person – me) can become exposed, but there are many more ways to compromise the database. Cross referenced databases, "phishing" type attacks, and social engineering are all ways to corrupt the privacy of the database.

This is all well and good if the only thing that happens as a result is that I get offered more appropriate advertising (guess I don't need that Rochester Big and Tall ad after all), or that the gifts I get are more to my liking (hmmm.. that tie from Indonesia vs. a new video iPod?) It's also fine if I can control the content and use of the information.

The problem is the information can also be used substantially to my detriment. I can be denied employment or insurance because of a perception in my preferences. "Don't ask, don't tell" doesn't seem to apply to my purchasing habits. For example, many medical professionals may refuse to take on patients who either have a history of filing malpractice actions, or who have, based on their profile, a perceived propensity toward litigation. Finally, you never know the offers you didn't get because your profile dictates that someone thinks you didn’t want it.

While laws such as the EU Data Privacy Directives and their equivalents in Asia generally give the data subject the right to access and correct personal information collected, this right may not extend to aggregated information – which ultimately is nothing more than lots of personal information.

The laws need to be tightened. We need to redefine personal information as any information from which the identity of a person can reasonably (and sometimes unreasonably) be determined (this is actually the general standard for laws like the HIPAA, but is generally not well enforced).

More importantly, we need to have some guidelines on what general information can be collected, collated, analysed, and processed, both by governments and the private sector. Until then, it's generally a free for all. And oh, my birthday is coming up...

This article originally appeared in Security Focus.

Copyright © 2006, SecurityFocus