Mining Menace: Could New EU Rules Put Data in Danger?

Endangered Data Week 2018

A year ago, the focus of Endangered Data Week was concern around the deletion of environmental data by the then-new US administration. This year, it’s not only politics but also copyright that poses a threat.


The worry comes from Europe this time, where debates are ongoing about a copyright reform that includes provisions on text and data mining (TDM) – the automated analysis of materials. This technique is already supporting journalism, smart cities and start-ups.  In particular, it is supporting research, publicly funded included, in order to discover new trends, connections, and treatments for disease. IFLA underlined its belief in the potential of TDM by signing the Hague Declaration.


To work, TDM frequently involves creating a new version of the work, in a format that can be ‘read’ by a computer (such as xml). The result is a dataset – the ‘raw material’ for mining. This dataset can be seen as a ‘copy’, and in this sense could be seen as requiring an exception to copyright.


From the Start, Progress Needed

Expressing a desire to promote TDM, the European Commission therefore included a copyright exception for this in its proposals in September 2016. On the positive side, the exception was mandatory and protected from override by contract terms or technological protection measures.


However, it also left a lot of room for rightholders to use security as an excuse to restrict access, and limited the new exception to research institutions. This last provision effectively implied that all other mining taking place without a licence was a illegal.


Libraries argued that this would create complexity – once there is legal access, there should be a right to mine. Restrictions on this should be kept to a minimum, given that they undermined users’ rights. We have seen some progress, in some of the opinions submitted by European Parliament committees, and sympathy for our arguments across the board.


A New Menace

However, in recent months a new idea has appeared – that after use, the datasets created for TDM should be deleted . The argument is that such copies could feed piracy, for example if the dataset finds its way onto the Internet by accident, by design, or by hacking.


This proposal is not only irrational, but potentially highly damaging to research.


Irrational, because the structured dataset does not compete with the original. Irrational, because there is no particular reason why such datasets are more at risk of piracy than any other copy of a work. And irrational, because when it comes to mining openly available resources, it is easy to find the original online or elsewhere.


It is damaging, because the importance of being able to reproduce the results of an experiment is a key test of its validity. If the dataset used to conduct the experiment is destroyed, there is no way of doing this. Damaging, because days or months of work may have gone into creating this dataset, just for it to be wiped away, and with it the value of the investment by the researchers or other miners. And damaging because it means that the same dataset cannot be used in future experiments by researchers with legal access to it.


A Need for Vigilance

The fears and misconceptions that seem to be driving efforts to delete the datasets created for TDM in Europe are both illogical and potentially very harmful, in particular when the datasets are held by publicly funded institutions. Nonetheless, the threat is there. Fortunately, we have an opportunity to stop this danger before it happens, by underlining the error that such a provision would represent.