Feeds

Public genome databases can leak identity

Anonymity only goes so far

Beginner's guide to SSL certificates

Public genome data is a significant risk to individuals, according to research led out by Yaniv Elrich, a geneticist at the Whitehead Institute for Biomedical Research.

The team that Elrich led was able to de-anonymise genome data using only public information and careful Internet searches. A little chillingly, individuals could be associated with patrilineal genetic characteristics, even if they weren’t in the databases. A family member’s presence in the database can be enough, if they’re related in the male line and carry the same surname.

Working with data published in two public genomic databases, Ysearch and SMGF, Elrich demonstrated the privacy risk by matching chromosome data with 50 individuals, in a paper published in Science (abstract here, full paper available free with registration).

Among the genome data recorded in the databases is a genetic marker called “short tandem repeats” (for which genetic science hasn’t yet identified a specific purpose), which are passed down the male line.

As the paper notes, it had been assumed that listing surnames in the databases didn’t place individual identity at risk, since surnames “could match thousands of individuals”. However, the genome data has become a genealogy tool as well, in databases such as YBase.

DNA sequencing pioneer Dr Craig Venter volunteered as a test subject in the research. With only the relevant DNA sequence, Dr Venter’s age, and the US state where he lives, Erlich was able to retrieve just two possible records – one of which was Dr Venter.

With a known surname, the searches become even more accurate: “Combining the recovered surname with additional demographic data can narrow down the identity of the sample originator to just a few individuals,” Erlich states in the paper.

“Surname inference from personal genomes puts the privacy of current de-identified public data sets at risk”, it continues.

“In five surname recovery cases, we fully identified the CEU* individuals and their entire families with very high probabilities … data release, even of a few markers, from one person can spread through deep genealogical ties and lead to the identification of another person who might have no acquaintance with the person who released his genetic data”. ®

*CEU refers to a particular genetic dataset: “multigenerational families of northern and western European ancestry in Utah who had originally had their samples collected by CEPH (Centre d’Etude du Polymorphisme Humain)”. ®

Internet Security Threat Report 2014

More from The Register

next story
GRAV WAVE DRAMA: 'Big Bang echo' may have been grit on the scanner – boffins
Exit Planet Dust on faster-than-light expansion of universe
SpaceX Dragon cargo truck flies 3D printer to ISS: Clawdown in 3, 2...
Craft berths at space station with supplies, experiments, toys
That glass of water you just drank? It was OLDER than the SUN
One MEELLION years older. Some of it anyway
Big dinosaur wowed females with its ENORMOUS HOOTER
That's right, Doris, I've got biggest snout in the prehistoric world
Japanese volcano eruption reportedly leaves 31 people presumed dead
Hopes fade of finding survivors on Mount Ontake
Relive the death of Earth over and over again in Extinction Game
Apocalypse now, and tomorrow, and the next day, and the day after that ...
prev story

Whitepapers

Providing a secure and efficient Helpdesk
A single remote control platform for user support is be key to providing an efficient helpdesk. Retain full control over the way in which screen and keystroke data is transmitted.
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.
Beginner's guide to SSL certificates
De-mystify the technology involved and give you the information you need to make the best decision when considering your online security options.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Secure remote control for conventional and virtual desktops
Balancing user privacy and privileged access, in accordance with compliance frameworks and legislation. Evaluating any potential remote control choice.