The Register® — Biting the hand that feeds IT

Feeds

British Library wants taxpayer to gobble the web

Cost? We don't know

Regcast training : Hyper-V 3.0, VM high availability and disaster recovery

British Library wants to archive the UK web, creating an invaluable national treasure trove of porn, celebrity trivia gossip and Daily Mail comments. But it admits it can't put a figure on the project - which looks like becoming a huge, open-ended commitment for the taxpayer.

Today the Library stepped up the pressure for the law to be changed, allowing copyright libraries to create copies of web material for research purposes of other copyright holders material. Five statutory libraries already have permission to make printed material available. Now the British Library says it wants the Web too.

"It's not a request for additional funding," a BL spokesperson said, but they couldn't say how much the creeping mission would end up costing us. At first, the BL won't archive every Tweet, but do an annual crawl, with some sites such as No 10 Downing Street archived more often. That would cost 220TB of data, it reckons about £4,000 in storage.

But that would barely make a dimple in a replica of UK web output, now that so many non-web chat areas have migrated to a home between angle brackets. The BL acknowledges there are eight million sites.

What, we wondered, was the point of archiving every single "Ashlee Cole iz a slag" typed into a browser?

"It may be that somebody wants to look back and research celebrity and this could be important to their research," we were told.

No doubt. But every Tweet and comment?

It was cheaper, the spokesman assured us, than employing a curator to choose between the best Ashley/Cheryl comments (for example).

Ah, right. So the mechanics dictate the curation policy.

But it was also fairer, he added, because the neutral, objective web bot couldn't be accused of bias. Even in momentous national conversations as the Cole divorce.

There are plenty of comments flying around this morning wondering why public money should be required to archive more than a handful of websites. Especially with Brewster Kahle's Archive.Org, which is privately funded.

At first the library told us the public was unaware that websites disappear without some part of the British state keeping a copy - an interesting claim. I've never met anyone who thinks all websites are preserved by some silent, omniscient backup programme.

Then the Library told us that the private sector couldn't be trusted to do the job, because future funding couldn't be assured. But with the British state in the red to the tune of £180bn this year, a defecit larger than Greece's in GDP terms (12.8 per cent), and frontline services such as nurses facing the chop, it's questionable whether anyone wants prefers to keep a copy of those Mail comments instead. ®

Cloud storage: Lower cost and increase uptime

But...

The current law on books is that every book or periodical that gets published commercially in the UK must be supplied to 5 libraries that hold copies in perpetuity. There is no judgement on suitability. If it's published, it's in. They are just trying to maintain the status quo, and I think that's a good thing. I have seen many websites vanish with only a partial mirror at archive.org . Among the legions of dross at Geocities, there were several gems, including one of the two best internet libraries of Scottish Gaelic song lyrics that were lost.

Then there's the idea of corpus research. Having access to all these tweets and comments would allow language researchers to examine questions like how the internet is changing literacy, and that is a genuinely interesting and important topic.

7
1
Anonymous Coward

It's be another completely pointless use of taxpayers money

http://www.archive.org/

4
0

Finally, an answer!

I sent a request to ask the BL whether they could archive some of my online work several years ago, for copyright purposes. I suppose this is an answer of sorts.

I know they were having extended discussions about how to archive the data, since digital degrades horribly -- is there any word on that?

It's pretty neanderthal for people to be worrying about the trivial cost of this. I use the BL quite a lot and am thankful that it has archived stuff that a previous commentard would think "irrelevant" from the 16th Century, at far greater expense I might add.

3
0

More from The Register

 breaking news
BBC-featured call centre slapped with hefty fine for unwanted calls
PPI pests: Swansea-based firm stung for £225k by ICO
Microsoft to open Windows Stores inside 600 Best Buy locations
Product showcases 'must be seen to be believed'
Author Iain (M) Banks falls to cancer at 59
Misses the release of his final work
 breaking news
What did the Lehman Brothers implosion look like to a techie?
Insider tells all about the Gnab Gib at Lehmans
It's official: 'tweet' an English word – not just in the avian sense
If the Oxford English Dictionary says it is so, then it is so
 breaking news
The only Waze is Google: Ad giant tipped to gobble map app 'for $1.3bn'
Pac-Man-satnav-ish upstart in bidding war with Apple, Facebook
 breaking news
1-in-10 e-tomes 'are self-published'... most are 'rubbish' says book ed
Publishing man scoffs at go-it-alone writers, ursines still fouling in forests
 breaking news
Facebook RSS reader said to uncloak June 20
Secret event scooped by Scottish developer?