Original URL: http://www.theregister.co.uk/2013/12/11/feature_geeks_guide_uk_national_archive/

How the UK's national memory lives in a ROBOT in Kew

El Reg visits the National Archives

By Joe Fay

Posted in Geek's Guide, 11th December 2013 11:21 GMT

Geek's Guide to Britain The UK’s National Archives in Kew have enough gems and hidden secrets to keep Indiana Jones or Robert Langdon in sequels for the next couple of centuries, with everything from the Domesday book to the UK’s official UFO records socked away safely in its sanctum.

But for the swashbuckling archaeological hero of the future, whips, guns, and medieval symbolism will not be enough to unlock all of Kew’s secrets. They will also need the ability to manipulate a tape library and emulate long-dead file formats.

That’s because over the next couple of decades, the expansion of the UK’s official national memory will no longer consist of handwritten vellum-based documents and lovingly typed Cabinet meeting minutes in buff-coloured folders. Rather, with government business increasingly being conducted electronically, the National Archives will also expand virtually, one cartridge at a time.

As the national memory of Britain, the National Archives is a comparatively recent invention. For centuries, official documents, crucially the statutes passed by Parliament and the judicial decisions that defined the UK’s common law, were dispersed across bundles and trunks at sites including the Tower of London and the Palace of Westminster. Other key documents were in private hands.

A little history of a lot of history

Things became more systematic with the creation of the Public Records Office in the early 19th century. This was long headquartered in a soaring Victorian Gothic building on Chancery Lane, though procedures were occasionally subject to the whim of individual “keepers”.

The Grigg report of the 1950s laid down a new mission and methodology for choosing which documents should be preserved, including the 30-year rule*. This new modern approach to document-keeping clearly required a new modern building for the documents.

The Ministry of Works, in its wisdom, chose a site in Kew: a WWI era hospital, which already housed the government’s National Savings and Investments operation. The deeds of the historic few won out out over the chances of riches for the modern hoi polloi, and ERNIE (Electronic Random Number Indicator Equipment - the machine that generates the winning Bond numbers each month) and the National Savings Institute staff were shunted off to Blackpool. The new PRO building was opened in 1977.

Kew gained The National Archives moniker in 2003 after the PRO was merged with other record-keeping organisations including the Historical Manuscripts Commission. The current site comprises the original PRO building, a fine example of '70s municipal brutalism designed by the Ministry of Public Works, and a newer, lighter annexe.

The original building is saved from looking like a 1980s dole office or job centre by being skinned in light-coloured concrete, rather than the shit-coloured pebble-dash of so many of its contemporary government buildings. It is linked to the new building by a light-filled atrium/reception. With the enormous water feature out front and the leafy riverside location, it really is a nice place to visit. You’re as likely to be run over by a bunch of over-excited schoolkids as hordes of tweedy encased researchers.

National Archive Entrance

To reach the dark archive, first cross the water

Despite the name change, the operative word is still “public”. Anyone can walk into the lobby, veer left and hit the small museum covering some of the key items in the collection, including the Domesday book, or head upstairs to the public reading rooms, where you can consult records online or on microfiche. Register for a “readers' ticket” and you can request original documents.

Beyond the museum is the inevitable bookshop and a canteen, also public. When we visited there was chocolate bread-and-butter pudding on the menu, and the BBC’s favourite current favourite historian Lucy Worsley signing books in the bookshop. You might not be so lucky when you visit. Sometimes it’s rhubarb crumble.

After tea in the canteen with CIO David Thomas, we were taken into the newer wing, the domain of the archivists, web and database developers, historians and restorers who maintain the collection, and ensure that it is available to the people who pay for it - ie, you.

The new wing has a long light-filled atrium, which is lined with framed photographs. On closer inspection these prove to be the staff’s own work: holiday landscapes, wildlife, still lifes. Dotted amongst these are also lots of posters for yoga and meditation classes. They must work. The people we met there seemed engaged in their work, but not unduly stressed. The archive has been a thousand years in the making, so perhaps that gives you a slightly different perspective on time.

It possibly also helps that the site - or at least the repositories where the documents are kept - is literally bombproof. Fire safety is also paramount, with suppressant gas on tap rather than water. And unsurprisingly, the site is built to withstand floods, reassuring given its Thames-side location.

As you approach the repositories themselves - the actual document storage areas - you are gently reminded that these are pencil-only areas. Got that? No pens, no food, no drink. No smoking too? Need you ask.

Keep stuffing the repositories

As for the rooms themselves, don’t be thinking Raiders of the Lost Ark. The repositories themselves are large, though not cavernous rooms: the ceilings need only be high enough to accommodate top shelves within arm's reach. Warning posters detail exactly how to protect both the nation’s history and your back when handling the bigger box files. The repositories in the older part of the complex have narrow windows to afford the bookworms access to daylight, while ledges and skirting boards are brightly painted, all part of the old Ministry of Works strategy to keep the workforce happy. Temperature and humidity are carefully managed across the repositories, to keep mould and other threats at bay.

Old scrolls at the National Archive

Scroll on a roll

We saw a pile of blackened rolled-up manuscripts on a trolley, like props from a Harry Potter film, en route to or from some historical researcher. Picking a box at random off a shelf in one of the storage rooms, we found field survey books for a proposed early 20th century land tax. It never happened. WWI intervened, but the neat handwritten notes, still piled up down in Kew, give a copperplate, ground-level snapshot of England just before that particular catastrophe.

Some plan drawers yielded a hand-drawn view of the 18th century harbour at Guadaloupe. This was in French: how did this French map come into the British hands? Other drawers revealed sheet after sheet of tobacco and whisky advertising posters from the 19th century, a legacy of the Worshipful Company of Stationers’ erstwhile role in registering copyright.

Other shelves also hold regimental records, civil service papers, cabinet minutes, various PMs’ correspondence, railway plans. Anything that is needed to track the progress and development of a country and much more besides.

Rolling shelves at the National Archive

History, boxed and filed

Of course, you need some way of navigating this pile of history. Until the mid-1990s, this was via the vast paper-based catalogue. Even with this, tracking a soldier’s career via his paybooks and regimental records, or tracking the policy turns that drew the UK into WWI meant a researcher needed to know, almost feel their way around the collection - even if only PRO employees were allowed to enter the repositories to retrieve the documents.

If only there was a machine that could pinpoint exactly the document you need and tell you where is. Or even deliver you a copy instantly. Even faster than the microfiche machines in the reading rooms...

When the PRO opened, back in the 1970s, it had one computer: a DEC-based docket ordering system used to manage requests to retrieve an item from the archive.

As David Thomas, CIO and a 40-year veteran of the PRO explains, in the mid-1980s, a few PCs started to appear on the desks of key members of staff. However, these were not networked until some time in the mid-1990s. Around the same time, the PRO took its first steps onto the web - though there is no evidence for that, as that first site was not actually archived.

If mid-'90s surfers found this system less than interactive, things were not much better for the PRO staff administering the site. Thomas says the process for updating the site involved sending a floppy disk to the government’s Central Computer and Telecommunications Agency in Norwich, once a month.

The web operation was brought in-house in 1996 - though it has fluctuated between in-house and contracted out ever since.

The catalogue itself began to move online in 1998, the first national archive in the world to do so. Someone presumably had to key it in - a gargantuan task. The catalogue currently stands at 21 million plus entries.

Who do you think we was?

The closest thing to a big bang for the archive was the release of the 1901 census online in 2002.

The names on this census were people within living memory of 21st century Britons with access to computers and the net. The stage was set for a massive explosion in interest in genealogy - or a complete disaster. In the event it was both, with Thomas confirming the system - built by sometime defence contractor Qinetiq - was completely “overwhelmed” initially as contemporary Britons opened their shiny laptops, fired up their new internet connections, and found out just how grindingly poor their great, great grandparents were.

It got over it though. Last year, the website clocked 13 million visits from 230 countries. The UK accounted for 63.6 per cent. Interestingly, “wills” was consistently the top search term last year, except for one month when it was ousted by UFO.

Surprisingly perhaps, the website, and in fact all the Archive’s IT, is handled on-site. There are 210 servers, all Xeon-based HPs. A dozen of these are used to host a further 150 virtual servers. The site currently has 316TB of storage on tap.

The catalogue too is an inhouse development. The “new” catalogue, Discovery, was launched in 2011. It is based on Mongo DB, and subject to regular updates. At just over 100TB when first deployed in 2011, it is expected to run into PBs by 2014.

The archive itself is about to be subjected to a tsunami of data for two reasons.

Firstly, you might not have noticed, but the UK government has shifted from the Grigg report’s 30-year rule for shifting documents from Whitehall closets to public archives, to a 20-year rule. This is why documents covering the Falklands War began going public last year. Serving politicians and civil servants’ early career screw-ups are now likely to come into public view mid-career.

Digital birth pains

For the the more technically minded, the early 1980s also corresponds with the time when computing and electronic comms and electronic record keeping started to make its way into government. The assumption is that by 2025 pretty much everything that finds its way into the archive in future will be “digital born” - though the Archive expects some departments will still be sending paper down to Kew “for many years after that date”.

This might sound like a recipe for a pretty seamless archiving process. You produce the definitive electronic version of a document, it is circulated, then in time finds its way to Kew and is preserved for posterity.

On the other hand, consider this. What word processor were you using 20 or 30 years ago? Where are the files you created with it? Have you still got the floppy disks it came on? Don’t tell me you have files you created in Microsoft Works?

Scale that up across the sprawl that is central government, with its shifting departmental structures, erratic and silo’d procurement strategies, and at times piecemeal upgrade programmes. Throw in the debate on politicians’ and advisers’ use of private PCs.

Suddenly you can see the prospect of a first world country that can no longer access, much less understand, its own historical documents.

So, to head off this nightmare, the archive has developed its own file format ID tool, Pronom. As Alex Green of the Archive’s Digital Records Infrastructure team explains, they “point it at a collection and it IDs the file formats”. Except when it doesn’t. When the archive took delivery of the records from LOCOG after the Olympic torch was snuffed out, it threw up lots of formats that weren’t recognised. Had the long-feared cyberattack on the London Olympics come to pass, albeit after the closing ceremony? The answer was a lot more prosaic. It turns out that LOCOG had tended to work on Macs, which as Green gently puts it, "is not usual in government".

At the same time, the team has collaborated with Tessella to develop a tool called Safety Deposit Box, which it describes as "a risk-based system to identify formats in danger of becoming obsolescent".

Green says, “We make sure everything we get is on a hard drive. It’s backed up in the dark archive [more on that later]...you can’t put anything in there where we don’t know what it is. It’s very controlled.” Incoming files are integrity-checked.

That takes care of the country’s internal documents - the minutes, the policy papers, the grand plans, and the grubby excuses that come in their wake.

But this is just the internal information from the government. And the Armed Forces. And assorted state agencies.

All this, and the web too

What about the stuff that business might call the customer or citizen-facing content? Yep, the National Archives has to look after that too, and it is now also the home of the UK Government Web Archive.

Bizarre as it may seem now, when UK.gov first dipped its toe into the internet 20 years ago, it didn’t occur to many people that websites should be archived.

Some of the earliest UK government ventures on to the web are sadly lost to history. The earliest finds, dating back to 1997, actually came from the Internet Archive Project. However, the Web archive’s Suzy Espley and her team are particularly taken with this early Treasury Page.

Now the the archive strives to preserve the UK government’s web presence for posterity: U-turns, right-turns and all. It uses a crawler to trawl the UK government’s web estate, aiming to hit sites every six months. With the government looking to shutter many obscure or unloved sites, the pressure is on. The web archive currently stands at around 80TB, with the crawler pulling in 1.6TB a month. At time of writing, there are 3 billion urls in the archive, with 1 billion captured last year alone.

But does anyone really care? Seems like they do. Espley said the archive gets around 15 to 20 million page views a month. This often maps to current events - the assumption being that visitors are often cross checking current government positions/statements against previous positions. When we dropped in, "badgers" was a top search term - this was the same month the badger cull had kicked off.

As the NHS’s care.data program grinds on, old NHS pages detailing the NPFIT will no doubt race up the rankings.

And as government continues to sprawl, no matter who is in power, so does the Web Archive’s purview. Thus it will be extending its archiving activities to cover social media in a couple of months.

Between the web archive and the increasing amount of "born digital" internal government documents, it feels inevitable that the amount of data in the electronic archive will swiftly outstrip that from pre-digital days, if it hasn’t already: no one has calculated exactly how much data a fully digitised version of everything in the Archives would require.

Still, storage companies in particular are regularly telling us more data is now created every couple of days than was created between the dawn of humanity and 2003. If that data fits the mission of the National Archives, then it has to be preserved... somewhere.

That somewhere is well above the waterline, just off one of the repositories, and is known as both the Dark Archive and the Robot Room. The Robot in question is a Sun StorageTek SL3000 tape library. One half runs LTO6, says Thomas, the other half is Sun’s own tape format. The tapes themselves are standard 6.25TB cassettes. It’s clearly an archive. The "Dark" bit of "Dark Archive" refers to the fact that when no one is in the room, the lights are off, rather than any suggestion that this is the “real” archive of the illuminati who really run the UK.

Our history, on tape

As we said earlier, newer collections generally arrive in a digital format and go straight into the Dark Archive. If you’re wondering what key “historical” events were talking about, Thomas quickly reels off the Hillsborough Inquiry, The Leveson Inquiry, the Olympic Games and the records for the latest Census.

The tape library has a theoretical maximum of 13PB, and the Dark Archive is expected to hit 6PB by 2020. By then around 0.63PB of data will be added to the archive every year.

The tape robot at the National Archive

David Thomas walks with a robot

Tape is not perhaps the sexiest of storage media - no helium, little in the way of nanometer scale technology beyond the media itself. But, as Thomas points out, it is a known technology, with us since the 1930s, and cheap - both in cost and environmentally. The tape vendors say it has a 30-year life cycle. While it remains to be seen whether that pans out, the team at Kew tests its existing tapes regularly, we’re assured.

"This is the future of web archiving," says Thomas, picking up a tape, adding, "at the moment".

As an aside, the 1,000-year-old Domesday book and 800-year-old Magna Carta are both written on sheepskin and are still readable to anyone with a passing familiarity with Latin and the urge to pop down to the Museum at Kew. HP and Sun have plenty to prove.

So, what of the those shelves of files, books, parchment and the rest? Shouldn’t the National Archives simply digitise the lot, then leave the originals mouldering in a vast annexe to a small room full of tapes?

Unlikely. Before a collection can be digitised, and therefore served up to the website and committed to tape, it has to prepared for imaging. This is a conservation job in itself. Capturing the image is just part of the process - you then have to produce the appropriate files, transcribe them and prepare them for publication.

National treasures

Just imaging a small collection - say 5,000 images - takes about four weeks. Larger collections take months, or even years. It’s a job for the patient.

So, if the rest of the building felt generally focused and calm, the National Archives’ conservation department takes this to another level. It’s on the ground floor, and when you walk in it’s like stepping back into the school woodwork room - a large space filled with daylight, with rows of large work tables, with aromas of oil and glue. Some of the tools – enormous cast iron presses, and vicious-looking guillotines – look almost as old as the documents.

It’s easy to visualise committed conservators painstaking repairing precious manuscripts, or rebinding ancient books using artisanal linen and animal-based glue.

But the head of conservation, Juergen Vervoorst, talks as much about the importance of overall “environment” and the “performance of the building”. Which makes sense. Most of the collection is unlikely to be touched from one end of the year to another. Once you head off the environmental threats - heat, light, moisture, critters - the biggest threat is going to be handling.

At least that’s the case with the older materials, whose characteristics are reasonably well known.

“The British government always paid attention to high quality paper. We don’t really have a massive problem,” says Vervoorst.

However, the newer materials - all those laminates, printer papers, etc, that crept into offices from the '80s onwards - have “completely different ageing characteristics... their life expectancy is probably much shorter.” Similarly, as photographic materials became more prevalent, “there [are] more problems to be expected.”

Asked what particular items send a shiver up his spine, Vervoorst reels off the “treasures” housed at Kew. Domesday, not one but two copies of Magna Carta, Shakespeare’s portfolios, Jane Austen’s diary, Henry VIII’s divorce papers from Anne Boleyn.

But for sheer spine-tingling magic, he lights on Kew’s copy of The Treaty of Versailles, which officially tied up WW1, and set the stage for WWII.

Not because of what it was, but for what it could have been: “History could have been so different with a different treaty in 1919.”

It’s easy to imagine Vervoorst, indeed any historian, pondering the signatures of the victors and vanquished on the peace treaty between Germany and the Allied Powers, contemplating how they go there and imagining different futures, different outcomes if they’d signed a different treaty.

Can you imagine a future historian, waxing so poetically as he holds a tape cassette in his hand? You’ll have to wait a few hundred years to find out. ®

Boxes of tapes at the National Archive

This is what history will look like in future

Address

The National Archives in Kew, Richmond, Surrey TW9 4DU

Getting There

Take the overground train or tube to Kew Gardens. The R68 bus terminates outside. By car, set your satnav for TW9 4AD. Parking space is limited and allocated on a first come, first served basis.

Opening Times

The National Archive is open from 09:00 to 17:00 on Wednesdays, Fridays and Saturdays. On Tuesdays and Thursdays it's open until 19:00. It is closed Sunday and Monday.

Website

http://www.nationalarchives.gov.uk/visit/where.htm

Around and about:

Kew is a residential area, but the clue is in the station name - Kew Gardens. This leafy paradise is just minutes from the archive. Recommended locals include The Botanist and The Greyhound.