Feeds

British Library tracks rise and fall of file formats

Analysis of 2.5 billion online files suggests software obsolescence slowing

  • alert
  • submit to reddit

Secure remote control for conventional and virtual desktops

File formats and the software capable of reading them are living longer than previously thought, according to a British Library and UK Web Archive study.

Formats over Time: Exploring UK Web History (PDF, slides as PDF) considers 2.5 billion files author Andrew N Jackson retrieved with the help of the Internet Archive and the Joint Information Systems Committee (JISC). All the files come from “the UK web domain” and come from the period between 1996 and 2010.

Jackson used Apache Tika and PRONOM's DROID tool to inspect the files and determine the format they use. Central to the research was Jeff Rothenberg's 1997 prediction that “Digital Information Lasts Forever – Or Five Years, Whichever Comes First.” Jackson is also keen on a rebuttal from David Rosenthal, who he quotes as saying: “When challenged, proponents of [format migration strategies] have failed to identify even one format in wide use when Rothenberg [made that assertion] that has gone obsolete in the intervening decade and a half.”

Jackson's take is that file formats seem to last rather longer than five years even if they don't survive forever.

“While there were just two active versions of HTML in 1996 (2.0 and 3.2), all six were still active in 2010,” he writes. “Similarly, there were three active versions of PDF in 1996 (1.0-1.2) and eleven different versions in 2010 (1.0-1.7, 1.7 Extension Level 3, A-1a and A-1b, with 1.2-1.6 dominant). In general, it appears that format versions, like formats, are quick to arise but slow to fade away.

HTML versions found online in the UK between 1996 and 2010

Jackson attributes formats' longevity to the Network Effect, but also writes that he is uncomfortable drawing firm conclusions about software obsolescence given the sample is UK-centric and the tools used to analyse data identify files imperfectly.

He nonetheless concludes:

Our initial analysis supports Rosenthal's position; that most formats last much longer than five years, that network effects to appear to stabilise formats, and that new formats appear at a modest, manageable rate.

But he also warns that “a number of formats and versions that are fading from use, and these should be studied closely in order to understand the process of obsolescence.” ®

Beginner's guide to SSL certificates

Whitepapers

Choosing cloud Backup services
Demystify how you can address your data protection needs in your small- to medium-sized business and select the best online backup service to meet your needs.
Getting started with customer-focused identity management
Learn why identity is a fundamental requirement to digital growth, and how without it there is no way to identify and engage customers in a meaningful way.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Reducing the cost and complexity of web vulnerability management
How using vulnerability assessments to identify exploitable weaknesses and take corrective action can reduce the risk of hackers finding your site and attacking it.
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.