More like this


Yahoo dumps 13.5TB of users' news interaction data for machine eating

Don't worry whoever Yahoo's 20m users are, it's 'anonymised'

Yahoo! has publicly dumped a sample dataset for machine learning enthusiasts based on "anonymised" user interactions with the news feeds of several of its properties, ostensibly extending the research bridge between industry and academia.

The dataset contains ~110bn lines, coming in at 1.5TB bzipped, which decompresses to a whopping 13.5TB and covers the news item interactions of 20 million users between February and May 2015.

"Many academic researchers and data scientists don’t have access to truly large-scale datasets because it is traditionally a privilege reserved for large companies," said Suju Rajan, director of research at Yahoo! Labs.

The data could be used by researchers to "validate recommender systems, collaborative filtering methods, context-aware learning, large-scale learning algorithms, transfer learning, user behavior modeling, content enrichment and unsupervised learning methods".

"We are releasing this dataset for independent researchers because we value open and collaborative relationships with our academic colleagues," said Rajan, adding the company is "always looking to advance the state-of-the-art in-machine learning and recommender systems."

The "anonymised" interaction data is accompanies by the demographic information (age segment and gender) and the city in which the user was based. Further, it has all been time-stamped with the user's local time and contains "partial information of the device on which the user accessed the news feeds, which allows for interesting work in contextual recommendation and temporal data mining".

Yahoo! has also released the title, summary, and key phrases of the pertinent news article between these instances.

The dump comes from Yahoo! Labs' Webscope programme, which attempts to create a reference library of scientifically-useful datasets for research purposes.

Yahoo! stated that the dataset it is "releasing today is governed by our commitment to safeguard our users' privacy and follows our practice of protecting and anonymising user data."

"Access to datasets of this size is essential to design and develop machine learning algorithms and technology that scales to truly ‘big’ data," according to Professor Gert Lanckriet, of the University of California's Department of Electrical and Computer Engineering. ®

Sponsored: Fast data protection ROI?