Feeds

Two centuries of Hansard to move online

Web no longer 'somewhere data goes to die'

Beginner's guide to SSL certificates

Parliament hopes to place all Hansard reports - from 1804 to 2004 - online by the end of this year.

Its information management department is using optical character recognition (OCR) technology to turn three million printed pages of the record of Parliamentary proceedings into digitised text. Some is already online, although the project has not yet been officially approved as a version of Hansard.

Edward Wood, Parliament's director of information management, said the department has sliced up original bound copies of Hansard to obtain the pages for scanning – adding that such books are commonly available, as many libraries are selling them.

"For me, it symbolised opening up the data," he told Kable's Electronic Document and Records Management conference.

Wood said the main aim was to avoid expensive conservation work on printed versions of Hansard used by Parliament's members and staff, but also to allow better searching and reduce storage costs.

The process compares the results of three OCR scans with 100 per cent of the results proof read by a contractor. Parliament also proof reads one per cent to check the quality of the work. Wood said although the likes of Google and Microsoft have digitised some of Hansard as part of other projects, their work "is not particularly good, on the whole – there's very little metadata".

Robert Brook, a developer working on the project, said the system aims to provide excellent metadata, with material linked by bill, MP, constituency and even monarch. "Previously, we've treated the web as somewhere data goes to die," he said, but the aim of this project is to open it to numerous uses.

This has been evident in the eclecticism of searches made by users so far, Brook added. "I expected them to look for Tony Blair and Iraq," he said, but instead popular searches have included Telic, the code name for Britain's operations in Iraq, asbestos use in playgrounds, and Corsham's military communications centre.

Around 95 per cent of searches come through Google. "No one uses our search engine, which is really galling," said Brook. But he added that this means people are finding the nascent system as part of general search, rather than specifically looking for Hansard.

Brook said the system, which relies entirely on open source software and uses open data standards to allow reuse and mash-ups on other websites, will add another decade's worth of material in the next month. If it wins approval, it will eventually "get a portcullis on top", he said, and be adopted as an official archive of Hansard.

This article was originally published at Kablenet.

Kablenet's GC weekly is a free email newsletter covering the latest news and analysis of public sector technology. To register click here.

Protecting against web application threats using SSL

More from The Register

next story
Phones 4u slips into administration after EE cuts ties with Brit mobe retailer
More than 5,500 jobs could be axed if rescue mission fails
Israeli spies rebel over mass-snooping on innocent Palestinians
'Disciplinary treatment will be sharp and clear' vow spy-chiefs
Apple CEO Tim Cook: TV is TERRIBLE and stuck in the 1970s
The iKing thinks telly is far too fiddly and ugly – basically, iTunes
Huawei ditches new Windows Phone mobe plans, blames poor sales
Giganto mobe firm slams door shut on Microsoft. OH DEAR
Phones 4u website DIES as wounded mobe retailer struggles to stay above water
Founder blames 'ruthless network partners' for implosion
Found inside ISIS terror chap's laptop: CELINE DION tunes
REPORT: Stash of terrorist material found in Syria Dell box
Show us your Five-Eyes SECRETS says Privacy International
Refusal to disclose GCHQ canteen menus and prices triggers Euro Human Rights Court action
prev story

Whitepapers

Providing a secure and efficient Helpdesk
A single remote control platform for user support is be key to providing an efficient helpdesk. Retain full control over the way in which screen and keystroke data is transmitted.
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.
Security and trust: The backbone of doing business over the internet
Explores the current state of website security and the contributions Symantec is making to help organizations protect critical data and build trust with customers.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.