Inside Internet Archive: 10PB+ of storage in a church... oh, and a little fight to preserve truth
Stopping the powerful changing history, Orwell style
At the Internet Archive's headquarters in San Francisco, California, on Wednesday, technologists, educators, archivists, and others fact-oriented folks gathered to discuss how they and the like-minded can save news from the memory hole – a conceit conjured by George Orwell to describe a political mechanism for altering the truth.
The event, Dodging the Memory Hole 2017, was the fifth such gathering since 2014, sponsored by the Donald W. Reynolds Journalism Institute and a grant from the Institute of Museum and Library Services. It comes at a time when news publishers in the US faces heightened hostility from the Trump administration, not to mention ongoing revenue pressure.
The Internet Archive is a non-profit digital library, explained founder Brewster Kahle during his keynote presentation. The Archive's goal, he said, is to provide universal access to all knowledge. In that it echoes Google's self-avowed aspiration, but without the ads, data harvesting or commercial chicanery. And with a handy little copyright exemption.
The organization is based in an old and rather grand Christian Science church in the Richmond district of San Francisco, and it keeps online copies of books, audio and video recordings, texts, software, and more, like you'd expect from a digital library. It is best known, perhaps, for the Wayback Machine: a backup cache of 308 billion webpages scraped automatically from the public internet. The data is stored on servers in California with a total capacity of 35PB – 10PB of which we saw sitting at the back of the church.
Internet Archive founder Brewster Kahle
To underscore the Internet Archive's civic purpose, Kahle recounted how on May 1, 2003, the White House issued a statement about the Iraq war: "President Bush Announces Combat Operations in Iraq Have Ended." That declaration was subsequently modified without notice to read: "President Bush Announces Major Combat Operations in Iraq Have Ended."
Later, Bush's statement was removed from the web, but remained preserved in the Internet Archive. It would be December 2011 before combat operations in Iraq actually ended, at least from the perspective of the Obama Administration.
"We want to make it so you can't just take things off the net and put them down the memory hole," said Kahle.
Kahle and others made it clear that today's political climate has added a sense of urgency to digital preservation efforts. Following the 2016 election, the Internet Archive and its community of concerned archivists worked to capture 100TB of information from government websites and databases out of concern it might vanish. It's a job with no end in sight.
"Things are very dangerous right now for internet content," said Art Pasquinelli, LOCKSS partnership manager at Stanford University.
Information on the internet is being filtered and fractured through social networks, Pasquinelli suggested. It's often presented without useful context. Data sets may become inaccessible.
If there's any good news, it's that the Internet Archive itself hasn't been attacked directly, at least in a major way, to stop it from what it's doing. "We don't see people trying to modify the records that we've stored," Kahle told The Register. "We haven't felt like we've been attacked. We've been used mostly for the purpose that we've been designed for."
The Internet Archive isn't so much concerned with preventing the spread of misinformation as with making sure information of all sorts remains accessible.
"We're not a good judgement organization, but we can build collections and make them permanent," Kahle said.
Kahle would like to see social networks do more to make data available.
"I found it curious that Facebook didn't have the ads that they used to run," he said, in reference to the social network's role in distributing and then losing divisive Russian-backed political ads and posts during the US presidential election. "And we probably don't have those either, because we don't archive Facebook very well."
Among those presenting at the event, the focus was on tools for keeping information alive, like digital archiving software LOCKSS, and Robust Links, a proposal for adding more information to prevent reference rot online
Reference rot encompasses both link rot, when links on pages no longer work, and context drift, when content editing undermines past citations.
The Internet Archive has been fighting reference rot with bots. Mark Graham, director of the Wayback Machine at the Internet Archive, said that the Archive worked with Wikipedia over the past year to find and correct some 3.8 million broken links.
"One of the problems we've been working on is trying to help make the web more reliable," said Graham.
The Internet Archive has a couple of dreams, said Kahle. One is getting fresh copies of its data out of the US, because it's good to have an offsite backup. Mirrors are held in the Bibliotheca Alexandrina, in Egypt, and a location in Europe. Another is the decentralized web. And then there's the one about footnotes.
"We'd like to turn all footnotes blue," said Kahle. "Wouldn't it be great if PDF viewers – Preview on the Mac or the one that's bundled into Firefox – were to go look for footnotes and turn them all into hypertext links?" ®