Open Data as a Moving Target: What Does it Take to Allow Reuse?

By Irene Pasquetto

As we all know too well, making all scientific data technically and legally accessible to img_20160913_133322all researchers is an ambitious task complicated by constantly evolving social and technical barriers. It is fair to say that we are making progresses in this direction. At Scidatacon 2016, we examined several concrete solutions that can facilitate openness of scientific data or, if you prefer, make sure data are FAIR (findable, accessible, interoperable and reusable).

However, it seems that the more we learn about how to make data open, the least we know about how exactly data will be reused by the scientific community, which means by the researchers who generated the data and should have a primary interest in accessing it. Very few empirical studies exist on the extent to which open data are used and reused once deposited in open repositories.

The fulcrum of the problem is that data take many forms, and are produced, managed and img_20160913_133754used by diverse communities for different purposes. Nevertheless, different stakeholders (publishers, data curators, digital librarians, funders, scientists etc.) bear competing points of view on the kind of policies, values, and infrastructural solutions necessary to make data open. During a session moderated by Christine Borgman (UCLA) and titled “How, When, and Why are Data Open? Competing Perspectives on Open Data in Science”, Matthew Mayernik (National Center for Atmospheric Research), Parsons Mark (National Snow and Ice Data Center) and Irene Pasquetto (UCLA – Center for Knowledge Infrastructures) presented on some of those challenges that make the use and reuse of “open data” such a complicated and heterogeneous process.

Mayernik argued that the integration of the Internet into research institutions has changed the img_20160913_140145kinds of accountabilities that apply to research data. On one hand, open data policies expect researchers to be accountable for creating data and metadata that support data sharing and reuse in a broad sense, in many cases, to any possible digital user in the world. On the other hand, providing accounts of data practices that satisfy every possible user is in most cases impossible.

In his talk, Parsons effectively showed that data access is an ongoing process, not a one-time img_20160913_134549event. Parsons and his team examined how the data repositories products and their curation have evolved over time in response to environmental events and increasing scientific and public demand over several decades. The products have evolved in conjunction with the needs of a changing and expanding designated user community. In other words, Parsons’ case study shows that it is difficult to predict the users of a data service because new and unexpected audiences (with specific needs) could emerge at any time. Parsons also argued that, for this reason, “data generators” may not be the best individuals to predict future uses of their own data.

Because open data users change over time, it is also necessary to built open repositories that provide data in formats flexible enough to allow different approaches to data analysis and integration, for different audiences. This was the point made by Pasquetto, whose case study is a consortium for data sharing in craniofacial research, with a focus on the subfield of developmental/evolutionary biology that recently adopted genomics approaches to knowledge discovery. Pasquetto found flexible data integration to be a necessary precursor to using and reusing data. “Data integration work” is the most contested and problematic task faced by the community, where data need to be integrated at two or more levels and these levels require extensive collaboration between engineers, biologists, and bioinformaticians.

Borgman also presented a paper on the beneath of Ashley Sands, who recently graduated img_20160913_135716from the department of Information Studies at UCLA and is now senior program officer at the Institute of Museum and Library Services in Washington DC. This talk examined characteristics of openness in the collection, dissemination, and reuse of data in two astronomy sky survey case studies: the Sloan Digital Sky Survey (SDSS) and the Large Synoptic Survey Telescope (LSST). Discussion included how the SDSS and LSST data, and datasets derived from the projects by end users, become available for reuse. Sands found that the rate at which data are released, the populations to which the data are made open, the length of time data creators plan to make the data available, the scale at which these endeavors take place, and the stages of these two projects all have great impact to the extent in which data and then reused.

Moral of the story: open data is a fast moving target. In order to enable reuse, data repositories better start to run.