The trails left in Web server logs – and who's seeing them
Fear of a million Big Brothers
NEW YORK--The privacy advocates and civil libertarians at the 13th annual Computers, Freedom and Privacy conference sometimes seem dwarfed by the enormity of the projects they oppose -- larger-than-life enterprises worthy of a James Bond villain.
John Poindexter's Total Information Awareness project, if successful, would combine every government and private sector database into a massive data mining system capable of picking out aberrant behavior in the actions of seemingly-ordinary citizens. The Department of Homeland Security's CAPPS II program aims to run automatic background checks on every airline passenger in the U.S.
But the day before CFP 2003 began, a smaller invitation-only group of technologists and policy wonks met at the conference site to discuss a matter that some say is just as important to Internet privacy as any of the monolithic omniscient supercomputers being hatched in Washington... The humble Web server log.
Or more to the point, the countless thousands of logs routinely kept by servers throughout the Internet, each marking every visit to a given website, identifying what pages were viewed, what transactions made, and the Internet IP address of the visitor. Recent laws have made it easier for government agencies to get their hands on server log entries, and civil litigators are increasingly finding logs a valuable target for subpoenas. At the same time, the art of wringing every ounce of useful information out of such logs is advancing, as is the ease of tracking down a user's identity from their IP address by correlating data from different sources.
Last month, scientists at Carnegie Mellon University's Laboratory for International Data Privacy even published a formal algorithm for "re-identifying" a Web surfer from pieces of information left like breadcrumbs on different sites. "The methodology involves constructing trails across locations from small amounts of seemingly anonymous or innocuous evidence the person has been there," the paper reads.
That's a troubling prospect to privacy advocates, at a time when activists and human rights workers in repressive countries are using the Internet to communicate, while ordinary netizens are turning to the Web for things like medical information or personal finance. "It's our sense that certain companies have entire staffs dedicated to handling subpoenas and court orders, and quite often those subpoenas and court orders involve usage logs," says Will Doherty of the Electronic Frontier Foundation.
Smaller companies may be keeping logs without thinking about the potential for misuse, and a careful Google search can turn up random server and proxy logs sitting unprotected on the Web. "Most people don't give it any thought; their default is to just log anything in Apache or IIS," says Richard Smith, a technology and privacy consultant. "At most, they have to worry about how much disk space it's taking up."
It's with that vision of a million tiny surveillance logs growing like weeds that the informal "User Log Data Management Working Group" had that first day-long meeting Tuesday. "We got as far as discovering the extent of the problem, and some sense of who had an interest in it," says Jeff Ubois, the workshop's organizer. Among the 18-odd attendees, which included Doherty and Smith, the meeting drew Internet archivist Brewster Kahle, FTC consumer-protection attorney Laura Mazzarella, and John Young, curator of the controversial full-disclosure cryptography and intelligence site Cryptome.org. Young, who himself has received at least one broad subpoena for usage log information, takes pride in deleting his logs on a daily basis.
Nobody expects Yahoo or MSNBC.com to delete their logs every day. But attendees say the workshop concluded that companies of all sizes need to become more familiar with the privacy risks of their routine logging. The group plans to launch an education campaign to dispel the notion that Internet surfing is anonymous by default. "If it becomes widely believed that IP addresses are personally identifiable, that has implications for businesses that are logging them," says Ubois.
The group is also working on specifications for a free open-source tool that would allow administrators to easily trim unwanted information from their logs. Smith, who occasionally moonlights as a forensic crime fighter, admits that Web server logs can serve a valuable purpose in tracking down bad guys. But he says webmasters should know the significance of the data they routinely collect. "Most of this is about educating people that this could leave them in the legal line of fire," he says.
Sponsored: Hyper-scale data management