Oi, drag this creaking, 217-year-old UK census into the data-driven age
Challenge accepted, ONS bod Becky Tinsley tells El Reg
The UK's Office for National Statistics is under pressure. Every decade since 1801, it has carried out one of the world's most comprehensive statistical undertakings, the census. Now, it has until 2021 to prove it can do so without the massive surveys it still relies upon.
The population survey, which was sent out to 25 million households at its last outing in 2011, is seen as one of the UK's most valuable assets.
But back in 2010, under the banner of efficiency, the government decided things needed to change, pushing for the ONS to ditch the decennial mega survey for something more hi-tech (and low cost).
The ONS narrowed down the options on the table to either conducting the census online, or using administrative data – the information collected by government every time people interact with it, like claiming benefits or paying tax – and then backing this up with smaller, annual surveys.
Ostensibly the aim was to improve the outputs by making better use of new technologies for gathering and analysing data, while using data that would keep the statistics more up to date.
But upping the use of government data sets – which the powers that be will see as free since they already exist – and scrapping the large-scale and costly household survey for an online poll would also save a buck or two.
The planned changes were not without controversy. During the consultation period, social scientists were up in arms about the plans, arguing the new methods wouldn't be able to provide the same level of detail as the traditional census, especially at the small-area scale.
In the end, the ONS recommended a hybrid model with a full online census from 2021 onwards, combined with increased use of government data.
Although the government accepted this in the short-term, it restated its ambition to ditch the census after that, setting the ONS a challenge: invest in administrative data research and in 2021 compare those outputs with the results of the full census, to prove to users such data can be trusted on its own.
And so, since 2014, the ONS's administrative data census team has been on a mission to prove they can gather the disparate bits of information the government collects and piece it together into something as rich as the traditional census.
Changing the census
"It is a really complicated area and it's a big, complex challenge for us," says Becky Tinsley, the person tasked with leading the ONS's efforts to push the census into the data-driven age.
But, she says, the benefits should make the work worthwhile – for instance, it means residents won't have to fill in the survey and keeps the stats more up to date than the decennial one would. Using administrative data also creates new measures that the census doesn't cover.
"We don't currently ask a question about income on the census,"* says Tinsley. "But we've gained access to data from the Department for Work and Pensions and HMRC about PAYE, people's earnings, tax and benefits records, so we've started to produce admin-based income estimates down to quite small areas."
Tinsley says working with departments like this, to make them "understand why it's important to share that data", has been a major part of the team's initial focus. That's unsurprising, since the success of the project lives and dies on the ONS's ability to get hold of the data that's spread across government's infamous data silos.
The Digital Economy Act, which received royal assent last year, should make it easier for data sharing within government – there are also particular provisions in it for the delivery of national statistics – but it isn't fully operational yet.
Anyway, says Tinsley, even though "it's good to have legislation, you still need to be able to have those conversations and build those relationships with different data providers".
As well as DWP and HMRC, the Department for Education has also dipped its toe in the data-sharing water, while the Driver and Vehicle Licensing Agency and Higher Education Statistics Authority are two other obvious targets.
Tinsley adds that data from the Department for Business, Energy and Industrial Strategy could allow the ONS to develop new stats on fuel poverty, while Home Office information might be able to help it improve the UK's much maligned migration statistics.
But it's not just government departments that the ONS is courting. Back in November, it worked with network operator Vodafone to snaffle up data from London commuters' mobile phones and map their travel.
Location and timestamp data were collected, with home locations identified from where the phone was during the night or when it was switched on in the morning, while workplaces were based on where it was during standard working hours, as long as the user had been there more than twice a week.
Being a pilot project, the team ran into some technical problems that need to be ironed out. The method was thought to be incorrectly identifying many people, like students, as workers when they weren't.
But the mobile phone research also sparked a low-level public outrage over government snooping.
Tinsley is quick to point out, though, that all the data the ONS received was aggregated and anonymised by the telco. "We were interested in looking at the patterns of groups of mobile phones – we weren't really interested in individuals' movements," she said.
The same is broadly true for government data, although it is anonymised by a separate team at the ONS, rather than arriving anonymised and in aggregate.
"When we receive the data, we have a completely separate team who process [and anonymise] that data, one source at a time. We never bring together different sources of data until it's been anonymised," Tinsley says.
"Once it goes through the anonymisation process, at that point our researchers have access to it, and are able to link the data together to produce the aggregate insights into the population."
Tinsley says that linking the anonymised data at the individual level is crucial for users of ONS data, who might be local or national authorities using it for policymaking and service delivery, or academics for research.
"One of the strong views from our users is that they don't really just want to see something about people's ethnicity, for example, they want to understand how people's ethnicity might impact maybe on health conditions, so that they can they deliver the better services for the population," Tinsley says.
"That's why it's important we bring together that data at the record level, with the range of different sources – but it's only ever using anonymised data."
However, one of the major tests of the project will not be whether it can produce exciting new stats, but whether those users are satisfied that the administrative data census is as good as the traditional one. And that might involve some diplomacy, especially when some of the definitions the ONS has used for decades might have to change.
"One of the challenges with administrative data is that, because it's not collected for statistical purposes, some of the definitions are different to the questions we ask through the surveys," says Tinsley.
One such example is the concept of a household. In surveys, people are asked to think about who they share facilities with, but administrative data will simply cough up an address, which the ONS will have to use to put together a "household" based on that.
This might cause an issue, says Tinsley, because "there might be several families who live in one address together – and may define themselves as separate households on a survey – but we would capture them as just one group of people living together".
The ONS is working with users to assess the impact such changes might have – Tinsley estimates that in this situation it would affect about 0.5 per cent of people from the 2011 census. But she also notes that many other countries who have moved to this sort of census now use definitions based on addresses.
Adding smaller surveys
Meanwhile, a greater – and perhaps more straightforward – success has been in population size, which the ONS has already managed to map successfully onto what it calls the output area level, which is equivalent to very small areas of, on average, 300 people.
But even this most advanced area is not without challenges. "One in particular is that people aren't required to de-register from services – if you move you're not required to tell your GP, for example," Tinsley says. "So we see quite a lot of over-coverage in some of those data sources. That tends to be around working age men, where we think they've moved abroad but we still find them in the data."
This is also the opposite of the situation in the existing census, which tends towards under-coverage. The way the ONS currently deals with these sorts of anomalies is to figure out which areas are less likely to respond (the "hard to count" index) and target more surveys at those areas than others.
For the administrative data census, it might work in a similar way, says Tinsley. The ONS is already planning to supplement the admin data with surveys, so "we would perhaps put more of surveys in those areas" where errors might occur.
Missing data sources
At the other end of the spectrum are population characteristics like ethnicity and religion, which Tinsley says are not very well captured in much of the administrative data the ONS has access to.
For instance, the DfE collects detailed data on the ethnicity of schoolchildren within its school census, but this is "obviously only a small portion of the population" – the rest might have to be assessed through surveys.
Religion, meanwhile, could possibly be captured by proxy. "We're having conversations to find out if there are other sources of information that could at least correlate quite closely with it – so it might represent religion even if it doesn't capture religion."
The fallback option would be to use surveys, but Tinsley says this "is a bit limiting, in that – for the size of the survey we'd be looking at – we could probably produce good quality estimates down to local authority level" only.
Again, she says this is why it's crucial to have constant conversations with the users about what might be acceptable.
But time is of the essence. Although the major milestone – where the team's work will effectively be marked against the mostly online 2021 census – seems far away, if the government wants to switch to the alternative system after that, the smaller population surveys need to be operational from 2020.
And with initial results from the first batch of field tests not due to be published until the end of the year, the pressure doesn't look like it'll be easing off. ®
* In 2007, the census team looked into having an income question but found that it would have had too detrimental an effect on the number of people who returned the survey. Although returning the census is mandatory, it's worth noting that – just like paying taxes – plenty of people decide not to do it.