The Power in Our Collections: Data Ethics and Libraries

Supporting research has long been a core part of libraries’ mission. By collecting and giving access to information, they give scientists and researchers much of the raw material for their work. To use a (double) metaphor, they are the stepladder up to the giants’ shoulders from which Newton claimed to look.

Of course what is done with the information accessed has traditionally been a question left to the individual researcher (in the name of the neutrality of the library[1], and the academic freedom of the researcher) or, for the last few decades, to ethics committees.

Today, research is increasingly likely to draw not on formalised information, as contained in books, articles or other materials, but on data. This poses important questions for the continued effectiveness of ethics frameworks in place, and of course for the libraries who may be supplying the data.

It is therefore worthwhile for libraries, as holders of data, as facilitators of research, and of course as longstanding experts in information ethics, to keep up to date with ongoing discussions on data ethics.

Why a New Approach?

Data is clearly nothing new. Yet technological developments in recent years have given the subject a new profile, with the emergence of data mining, jobs for data scientists and data journalists, and even talk of a data revolution.

Two reasons perhaps underpin this. The first is the amount of information we can now collect. In addition to better record-keeping in general, this is primarily linked to the increasing use we make of digital tools for everyday activities. This use of course leaves a trace of data, often held by the company or other entity offering the service. Moreover, the rise of the idea of user data as currency – something to exchange for one Internet service or another – means that that there is a lot more flowing around.

The second is what we can do with it. New analytical techniques allow for the processing of far more data than could ever be done by an individual person, and in doing so for spotting links and drawing conclusions. The combination of datasets – and the application of algorithms to process it – offer new and unprecedented possibilities to extract information about ourselves and our world.

Why a New Ethics?

These new possibilities come with risks. There is already a list – a horror show – of some of the dark alleys that data-based research can take us down. Experiments looking at how interfering in people’s Facebook feeds can change their moods, Microsoft’s chatbot Tay, and work indicating that social media data can identify a person’s sexuality have created (deserved) concern.

Nonetheless, as a number of researchers have pointed out, the rules currently in place for research ethics may often miss big data projects. The extensive work of the Council for Big Data, Ethics and Society on the subject offers valuable insights.

In one contribution in particular, they worry that rules for research ethics should not necessarily offer an exemption for anonymised or publicly available datasets controls. On the first point, data techniques mean that anonymisation is only relative. The power of crossing datasets (as mentioned above) make it possible to pinpoint individuals using a number of different characteristics, often with surprising accuracy.

Linked to this is the (second) concern: that it is possible to do far more with data than many understand. Data may be placed online, under the assumption that it is innocuous and cannot lead to any harm or invasion of privacy. However, thanks to this same crossing of datasets, conclusions can be drawn which go far beyond what was anticipated when the data set was made available, and which can raise questions about potential for discrimination or other prejudice to individuals or groups.

Where Next?

It clearly does not help that the level of awareness around the sharing and use of data is not necessarily high, as evidence from the UK indicates. At the same time, tough legislative or regulatory measures risk stifling science. An improved code of ethics would provide a more useful response.

Professor Luciano Floridi and Dr Mariarosaria Taddeo, from the Oxford Internet Institute and Alan Turing Institute respectively, have therefore set out three key fields where ethics need to be developed:

Ethics of Collection: this relates to the way in which data is gathered in the first place, with questions around whether there has been consent from the subject of the data, and whether this is meaningful (i.e. did they really understand it). The importance of steering clear of collection practices that risk leading to discrimination, either against individuals or groups is also high on the agenda (are there some types of data that should not be collected?).

Ethics of Practices: this refers to the way in which data collected is then used. Key issues, as highlighted above, include privacy and secondary uses (i.e. data being used for activities other than what was originally intended). A fascinating piece in Slate sets out the way in which people dealing with data on homelessness chose not to mention race, given that research results would likely reflect and so replicate a situation characterised by discrimination, rather than reveal anything useful.

Ethics of Algorithms: perhaps the most politically attractive of the new dimensions, this refers to the ethical questions that arise when even the researcher may not understand the results being produced. The responsibility of the algorithm’s creator for the results emerging – and indeed the faith that should be put in the algorithm – are significant questions, given the power with which these tools are credited.

What Issues for Libraries?

As highlighted in the introduction, decisions about the ethics of any particular research path are the job of ethics committees. Yet libraries, as repositories of data, including social media data for example, are key in any broader discussions. Indeed, some of the questions raised around the collection of data pose a direct challenge to libraries in terms of whether they are acquiring material which lead to an invasion of privacy or prejudice to individuals.

As an emerging field, we are some way from a clear idea of what ethical standards for use of data should be. Nonetheless, it is not a question libraries can afford to ignore, not only given their existing experience with information ethics, but also their role in supporting research and innovation.

[1] Note sections 1 and 5 of the IFLA Code of Ethics (2012):

