Linux data-sharing licences: So, will big data hogs take the plunge?
Experts weigh in
With its new open data licensing framework, announced on Tuesday, the Linux Foundation has created legal frameworks around sharing raw, unorganised data to tempt generous companies, nonprofits, government agencies and researchers to do so.
But an expert says their current ambiguity makes them risky, and others are concerned over licensing compatibility issues.
Mike Dolan, the Linux Foundation's VP of strategic programs – who helped draft the licences – told El Reg that individuals or organisations working on machine learning, traffic flow or other data-heavy systems could gain a lot from sharing, such as improving algorithms and increasing resources.
But today (excluding sensitive data covered by law), you either keep your raw data a trade secret or release it with no IP restrictions, said Estelle Derclaye, an IP lawyer at the University of Nottingham. There are already comprehensive licence agreements for sharing and attributing data organised in a database (such as CC-BY, the Open Data Commons Open Database License, or the Open Data Commons Attribution License).
When Derclaye reviewed one of the two new licence agreements at The Reg's request, she told us: "I wouldn't want to sign it."
Why a new licence?
Dolan said the aim was "to ensure that data providers and users had clarity about their ability to curate, use, and share" in order to enable "the creation of open, collaborative data, collaborative data communities". Drafting began during the third quarter of 2016 because of a perceived gap in one-shop licence agreements.
He gave the example of training a drone to fly autonomously – what if a dataset didn't include any examples of trees, a user trained its drone on the data, and it crashed into one? Whose fault would that be?
One licence agreement requires that changes to data be shared. There's also a permissive choice, which Dolan expects to be the most popular because of the lower legal approval legwork.
The CDLA agreement does not put any restrictions on any results produced by processing and analysing the data.
Dolan said that "well over 100 lawyers" had reviewed the agreements and that the licences take into account differences between countries. Nevertheless, the framework is open to iteration. The team is opening a mailing list to facilitate public feedback and will "monitor" discussion.
Whose data is it anyway?
Daniel Himmelstein, a data science postdoctoral researcher at the University of Pennsylvania, told The Register: "Until recently, there was little awareness that data licensing could be an issue. This is an exciting development since it reflects that major players are now considering the importance. If more people feel comfortable releasing data openly because of these licences, then that's a win".
However, he was uncertain about the benefits of having a more "data-focused" licence agreement compared to creative commons licences. "I will likely continue to release most of my data under a CC0 public domain dedication," he said.
Dolan responded: "We did not draft the CDLA agreements to cure or fix any specific issues with other licences, but rather to look at what the current use cases required and build an agreement from that point of view based on what we've learned in open source-software licensing.
"All that we say is if there are attribution notices, you cannot remove them... In many jurisdictions there can be severe penalties for removing attribution notices so we wanted to prevent that from happening."
Derclaye, the author of The Legal Protection of Databases, said she could understand why the Linux Foundation had created these restrictive licence agreements for raw data – saying they'd be incentive for organisations to disclose it. At the same time, she thinks it "wouldn't have been much work" to modify the existing creative commons licences, such as the commonly used CC-BY, to accommodate raw data, instead of creating something from scratch.
Room to improve
What the Linux Foundation ended up writing is "too vague" and "might create problems". She argued that:
- Unlike CC-BY, the sharing licence agreement does not explicitly state that the data is royalty-free. The licensee would need to check with the Linux Foundation.
- The licence does not include language for removing technological protection measures, such as encryption or other anti-copy tech. (Dolan claims the licence does have this, though "we made it even more broad than just technological protection measures").
- The agreement does not explicitly state that the licensee can sub-license the agreement to other parties without the Linux Foundation's approval. This might come up if a PhD student switched labs and wanted to sub-license data to their new boss. (Dolan said: "Everyone gets a licence to use, modify and distribute it to anyone under the licence they're all agreeing to use – the CDLA").
- CC-BY has language explicitly allowing existing fair use laws in the US and exception laws in other countries, although the Linux Foundation does not touch on it. (Dolan responded that open-source software licences routinely don't explicitly reference such exceptions and that they would be dealt with by "applicable law").
- The licence adds explicit language stating that the data will not be considered a work of joint authorship – but the actual definition is unclear.
- The licence gives contradictory advice regarding moral obligations and attribution that is confusing – CC-BY is clearer.
The database law prof said it's better to be clear, even if an agreement is more restrictive than it would be otherwise. Because of the vagueness, she added, if you're using a Linux Foundation agreement in a shared resource with other data under a different license, there could be conflicts. "If they really want it to be useful, it's good to be aligned as possible."
Leigh Dodds, of the UK's Open Data Institute, told The Register: "Clear licensing, which gives anyone the permission to access, use and share data, is fundamental to the open data movement.
"While we welcome the efforts of the Linux Foundation, we are not yet clear on what these new licences bring to the ecosystem. Users need to understand how these new licenses are compatible with, or different from, existing creative commons licences (especially CC-BY 4.0 and CC-BY-SA 4.0), and whether they allow for relicensing."
The org will "continue to recommend use of CC-BY 4.0 as it is already well adopted internationally". ®
Sponsored: Beyond the Data Frontier