Dec 2018
The field of research data and associated services is in a rapid – and epoch-making – phase transition from a data sparse to a data-overloaded ecosystem. Many national and international efforts are underway to try and deal with the enormous challenges posed by instrumentation and automation and the associated explosion in the volume and complexity of data. We all try and keep pace with this phenomenon by deploying the analytical processes and tools needed to enable data-intensive science, supported by machines. In order that high-throughput data generation instruments and computers may effectively support the scientific and innovation process, both data and workflow components need to be machine-actionable. Building on and refining many earlier efforts, in 2014 the FAIR principles were formulated. These principles recommend that data (and services around them) should be Findable, Accessible, Interoperable and (thus) Reuseable, first and foremost by machines.
In 21st century science, computers need to be fully enabled to do the hard work of processing, pattern identification and machine learning in relation to enormous amounts of heterogeneous, distributed data. Human researchers, and the science system as a whole, will benefit from machine-actionable data as less time will be spent data munging. When data is stewarded and processed properly, ambiguity and non-reproducibility will be less of a problem as well. In addition, many datasets and resources are now either too large or too privacy-sensitive, or both, to be effectively routed around the globe for multidisciplinary and data-intensive science projects. Therefore, distributed machine learning is a new paradigm that I refer to as ‘data visiting’ rather than the classical model of ‘data sharing’.
These rapid changes have in significant respects taken science by surprise and many groups and infrastructures have great difficulties to adapt to this revolutionary new way of doing science. Rather than excellence in silos, and scholarly communication mainly designed for person-to-person information and knowledge transfer, we now need excellence across silos. We need to conceive of the underpinning ecosystem as – in essence – one computer with one, universal dataset. Workflows dealing with data and the data themselves are being reused over and over and need to be fully interoperable, reusable and reproducible. In particular when we address the major challenges facing our planet, as laid out in the UN Sustainable Development Goals, the data needed to gain the necessary insights come from many different domains and are frequently not purposefully generated for research. For an ‘Internet of FAIR Data and Services’ to emerge and flourish, all digital resources should be intrinsically FAIR and processable outside the environments and systems in which they were created. In other words, they need to be universally reusable. The good news is that computers can translate FAIR digital resources from one format to the other with high speed and minimal error rates as long as the machine has enough information about the resource. Another way of expressing the objective of FAIR is that when the resource is FAIR, machines know what it means. In essence, the machine can answer three major questions for each FAIR digital object or resource they encounter:
- What is this?
- What operations can be performed on it? and
- What operations are allowed?
With properly constructed FAIR digital resources, these questions can be answered, which enables machines (and thus also ultimately humans) to reuse them with full provenance outside their original context. Elusive as this may sound, I am very confident that the current international efforts in this exciting domain will soon yield the first scalable ecosystems that follow these principles, and major industries are already moving into this space as well. So be warned: the coming four years will not be ‘science as usual’!
CODATA has been around for roughly 50 years, and has lived in the data sparse times as well as now in the data rich era, which poses entirely different and daunting challenges, also for CODATA itself. CODATA, as a committee of the International Science Council (ISC), supporting the mission of ISC as the global voice of science and its role in the UN system, has the responsibility to fill a specific and strategic niche in the global ecosystem of research data related activities. Many other organisations have complementary roles that are either domain specific, national or regional or they are grass roots and community based. CODATA is actively engaging with these other international players in defining complementary and synergistic roles.
The data-intensive science and innovation challenge is obviously a global one: it should equitably involve all regions of the world and it cannot be solved sustainably within disciplinary or national silos. That is the niche in which CODATA should operate. CODATA also has a key role to play in the involvement of regions of the world that have been traditionally data and science-deprived. With the Internet of FAIR Data and Services emerging as we click, we should not widen the digital divide but leap-frog to close it, such that the new research ecosystem is also fair in the traditional sense. Open Science must also mean that no-one is left behind. The second bit of good news is that activities in the Global South are emerging at an early stage and some are ambitious enough to lead future developments.
As the CODATA President I work with the Executive Director, with the officers and Executive Committee, and with CODATA’s core staff to serve this multi-organisational ecosystem in service of the global science community. We also work with regional organisations such as the European Commission and the EU Member states with their major leading initiative for the European Open Science Cloud, which has an increasing number of partner initiatives in other regions. We build on the excellent work of our predecessors in CODATA, including the intellectual leadership of the past President Geoffrey Boulton and in close collaboration our parent organisation, the International Science Council.
As of 2017, and extending for the duration of my CODATA presidency, I also serve on the US National Academy of Sciences Board for Research Data and Information. With my election as president of CODATA, I will gradually hand over operational leadership in GO FAIR to others, and I will seek to play an ambassadorial role for both, to help drive a joint, converging and balanced ecosystem for international policies supporting open, data driven science. We also work to consolidate and make explicit the key role for each of the internationally operating data organisations and in particular to bring RDA, GO FAIR, WDS and CODATA even closer together, with clear and complementary mandates. When we lock arms at all levels from institutional to international, I am optimistic that by the end of my term as President, the first phase of the Internet of FAIR data and services will be up and running.
For all this to happen, it will be of critical importance that each of the data-supporting organisations is mandated and properly funded (although at the leanest necessary level) to serve the science and innovation communities, without competing for the same funds as the community they should serve. They should focus on those supra-level tasks that never make it to the top of the priority list of individual countries, regions, funders, researchers and innovators. In this set of partnerships, it is the CODATA mission to act strategically and globally to advance equitable Open Science, the FAIR ecosystem and to make data work for interdisciplinary global challenge research.
Research infrastructures have traditionally been almost an afterthought or considered other peoples’ problems, which has resulted in a very dangerous situation where core resources, massively used by researchers, such as curated data bases and collections, mapping and standard services are operating on a shoestring and go through a near-death experience each time funded projects run out. We, as the research community, should collectively speak with one voice, on these infrastructural and interoperability issues as trusted representatives of the real needs of the research community itself and society as a whole, towards policy makers, funders and unions dealing with the enormous data and analytics challenges we will face in the decades to come. It is an honour to be elected as the new president of CODATA and I hope to serve the community as expected.