This is the seventh in the series of short statements from candidates in the forthcoming CODATA Elections at the General Assembly to be held on 9-10 November in Gaborone, Botswana, following International Data Week. Barend Mons is a candidate for the CODATA Executive Committee as CODATA President. He was nominated by USA.
CODATA: 2018-2022
Barend Mons
Vision: CODATA serving the global community as a global champion of machine-actionable data publishing, according to FAIR principles in a well coordinated ecosystem of global organisations
A phase transition in Science
Both science and innovation are in the process of a methodological transformation. Because of the unprecedented amount of data we deal with, we are in the midst of a significant landslide away from a closed, individual-privilege-patent- and ‘center of excellence’ based system towards a system that has to support fully distributed, collective human intelligence much more effectively. But even more critically, a modern data science and innovation ecosystem should be able to maximise the use of powerful, distributed digital assistants.
The roughly ten million times increase of storage and compute power over the past three decades, accompanied by roughly a hundred thousand times decrease in storage costs, has finally brought us to a point where ‘ICT’ is frequently mis-conceptualised as a commodity. Consequently, we capture so much data and subsequently reveal such complex patterns in it, that the human mind is unable to make sense of these patterns anymore. That is…. without massive international collaborations and digital assistance. So, on top of the Internet for People, we now need and Internet for Machines, in which machine actionable data and services will play a central role.
A need for data stewardship
Unfortunately, our ability to deal responsibly with data as the principal first-phase output of the scientific process, has not kept pace with the generation and storage capabilities. The current reality is a glaring lack of expertise; a crippled practice of cottage-industry data stewardship; an almost complete lack of interoperability of data in domain silos; and a hopelessly outdated scholarly publication and reward system, which effectively prohibits open science and innovation.
Many reports have recently highlighted the unacceptable loss of valuable data, and the waste of time and effort as an estimated 70% of researcher time is spent on ‘data wrangling’. Furthermore, the persistence of narrative publishing in formats solely meant for human consumption is a nightmare for machine processing of results and data. It amounts to a means of hiding the data behind pay walls, embedded and difficult to extract from figures and tables, with remote and volatile links to ‘supplementary data’, without proper metadata and provenance. This picture is even more gloomy as it is precisely the lack of access to an reusability of data that results in the emerging and well-documented reproducibility crisis in science.
Much of this lament is painfully familiar and has been made repeatedly with too little real impact for over two decades…
The role of mandated organisations such as CODATA
Data-focused organisations with a global mandate can play a major converging role in the decades to come. With its mandate from ISC (representing nearly 200 national members and the international scientific unions and associations), CODATA should be in a unique position to assist and guide where appropriate the transition to modern data stewardship for open science. It is high time for CODATA to become a global champion of machine-actionable data publishing, according to FAIR principles, supplemented with narrative for humans, and so to help ensure an optimal data substrate for modern, data-intensive and (thus) machine-assisted science and innovation. The emergence of open science, and the recent merger of the two parent councils, prompt a timely occasion to recalibrate what the role of CODATA during and after the landslide may be. As a servant of the science and innovation communities world-wide, CODATA has to, first of all, redefine its goals in the new data reality. This should be done in line with the several high level reports from the European Commission (such as the various reports and the -SWD-roadmap for the European Open Science Cloud) and from the United States (such as the consensus study Open Science by Design), while also taking into full account the simultaneous efforts in other continents including BRICS activities and efforts for open science in Africa and Latin America. Next, CODATA needs to redefine its unique added value niche vis a vis other data related initiatives, such as the Research Data Alliance (RDA) and the Global Open FAIR (GO FAIR) initiative. These are relatively young organisations compared to CODATA but they enjoy rapid uptake in the community and in the turmoil associated with any landslide, there is confusion about the various roles they and CODATA play.
Multiple roles
The CODATA strategic plan 2013-2018 showed deep insight in the data revolution that was upon us even back then. However, the current rate of data production, and analysis, challenges has far surpassed even the boldest predictions at the time. Currently, in many scientific disciplines, the learning algorithm, frequently hyped as ‘artificial intelligence’ is now predominantly present in methodology. Contemporary science, even in disciplines where the other hype term ‘big data’ is not yet mainstream, increasingly relies on complex pattern recognition by powerful and self-learning algorithms, followed by human decision on ‘actionable knowledge’ emerging from ‘meaningful’ patterns. What we have seen in the past three years is a rapid development of machine-oriented initiatives such as the formulation of the FAIR principles (https://www.nature.com/articles/sdata201618), describing how data should be formulated, published and stewarded in a way that supports optimal reuse in open science and innovation for both machines and humans.
The Research Data Alliance (RDA) has also seen a remarkable growth pattern. Given the importance of these ‘data-driven science’ global movements, it is not surprising that RDA, GO FAIR and other organisations have arisen that address the opportunities and challenges of reusable data and services, each addressing different aspects and filling different, complementary niches in this tumultuous field. In fact, the time is now right to ensure that we create and support an efficient, mutually enforcing ecosystem of these organisations. That means staking out the appropriate ground for each, clarifying appropriate working space, synergies and eliminating unnecessary duplication. This vision includes clear definition of missions, comprising both bottom-up and more top-down approaches where appropriate, and focusing our efforts.
A vision of mutually enforcing collaboration
The following section represents my initial thinking in this area, sharpened by many discussions with leading colleagues in this field:
The oldest international coordinating organisation in the data space is CODATA, which has been in existence for more than 50 years. CODATA, is a committee of ISC, after the merger of the International Council of Science and the International Social Science Council. ISC has a second data related initiative, the World Data System (WDS), which is ten years old as an International Programme Office (but has roots in the International Geophysical Year of 1958). In addition to these ISC-affiliated organisations, the Research Data Alliance (RDA) is a five-year-old grass roots organisation mainly supported by the EC, the US NSF and NIST, and the Australian Department of Innovation. RDA has rapidly developed into a large (> 7000 individual members) organisation that serves a crucial public role, namely bottom up consensus building about approaches, protocols and standards in expert communities. The EC and the EU member states have been particularly active in the data space, also conceiving and supporting the European Open Science Cloud (EOSC) Initiative. The supporting GO FAIR initiative (Global Open FAIR), initiated as a kick start approach for the EOSC by the governments of The Netherlands, Germany and France, is rapidly growing into a practical network or existing networks of excellence in early implementation of community adopted approaches, protocols, standards and training. These four key organisations (CODATA, WDS, RDA and GO FAIR) are all international and cross-disciplinary in scope, mandated and poised to support the global science enterprise including pan-European, and global, domain specific research infrastructures and e-infrastructures. To better support global science, I propose investigating ways to better coordinate and differentiate the work space for these four and perhaps other more disciplinary and regional science data organisations. In the spirit of community-wide consensus I have discussed these issues for several months now with the leadership of CODATA, RDA and GO FAIR and the following section represents my resulting view.
A triangular shape
The key roles of CODATA, RDA and GO FAIR are distinct, complementary and synergistic. They are depicted in the triangle model below. Like any model, it will always fall short to describe reality in all aspects and dimensions, but it is a way to visualise the various complementary roles. It should be stated as a preamble that in many concrete actions, the roles of the three organisations will overlap, such as for instance in training and education, and advocacy for best practices. Therefore, the triangle model is also meant visualise how the tasks following from the focus described at the corners of the triangle dovetail when ever appropriate. With the recent establishment of the ISC, complementarity between the three organisations becomes even more pertinent. It should also be emphasised that each of the organisations has additional activities outside the scope of this collaborative structure.
RDA has a principal bottom up working mechanism centred around interest and working groups that address, and where possible solve, intellectual challenges associated with solutions needed around research data in the broadest sense (also data analytics services, software and basic compute issues are in scope). This is done in community driven manner and leads to recommended solutions and designs. Obviously, these have to be tested for feasibility in practice, which can be done anywhere, but GO FAIR has a strong mandate and basis in a growing number of so called GO FAIR Implementation Networks to rapidly test recommended solutions. These are expert communities with ‘critical mass’ (community leadership) and impact that can implement proposed solutions (by RDA or others) in practice. This also provides an early testing ground of such applications (in social change, training or actual module building for the Internet for FAIR data and services). Obviously, many key stakeholders in the community play a role in RDA working groups and in the organisation as well as in GO FAIR implementation networks.
CODATA is mandated by ISC as the international body for research data in the broadest sense, focusing on data policies, data science and data skills and education. Despite being a lean organisation, CODATA is involved in various implementation activities on interoperability, capacity building, training and dissemination,
but has also played a crucial role in the development of key data policies and principles (including those of the OECD, GEO and the ISC endorsed ‘Open Data in a Big Data World’ that have effectively become ‘soft law’ for the scientific community.
In continuous and structural collaboration, the three organisations, having already established very good practical working relationships and participating in each other’s activities whenever appropriate, can collectively serve the community by providing organised and emerging consensus building and design, coordinated early expert implementation and broad adoption of best practices. This is all pre-competitive, but can form the basis for certification of providers in the EOSC, US, BRICS and beyond.
What is my motivation to serve as CODATA president for the 2018-2022 term?
Being involved in the early days of RDA and (GO) FAIR, I have seen many critical decision points where the development of an effective ecosystem for open, FAIR science and innovation could have gone astray.
Risks for science in the data intensive age include (re)centralisation, recidivist monopoly formation, exclusion of the private sector (critical for innovation and scaling), defending powerhouses built in the transition phase, and further propagation of ‘yet more standards’. In addition, there are many misperceptions around frequently used hype terms such as ‘open’ (versus FAIR), AI, Big Data, Data Sharing, Open Access (articles) versus Open Science, Linked (Open) Data, Semantic Modelling etcetera. It is therefore imperative to support global, community compliant consensus building on commonly accepted definitions of these central concepts. CODATA, in close collaboration with RDA, GO FAIR and others, could prevent many of these potential mistakes and play a key intellectual leadership role in the transition phase described here.
As CODATA President I would like to work with the core staff in this multi-organisational ecosystem and ensure that CODATA will have a solid, specific, recognised, and effective role. I was the organiser of the foundational meeting in 2014 where the FAIR principles were conceived, the Chair of the HLEG of the EC on the EOSC and I currently co-lead the GO FAIR International support and coordination offices, with branches in the Netherlands, Germany and France. I have extensive connections to RDA, and serve on the US National Academy of Sciences Board for Research Data and Information (which is the US National Committee for CODATA). If I were also to help lead CODATA, I would concentrate on an ambassadorial role for the joint ecosystem of the various organisations as summarised in the triangle model above, and thus be in a good position and interested in using these relationships to bring RDA, GO FAIR and CODATA closer together, as well as determining the role of WDS in the new reality, each with their specific and complementary expertise networks, thus creating greater strength to the common good.
For all this to happen, it will be of critical importance that each of the supporting organisations is mandated and properly funded (although at the leanest possible level) to serve the science and innovation communities, without ever competing for the same funds. They should focus on those supra-level tasks that never make it to the top of the priority list of individual researchers and innovators.
If you agree that the time has come to better coordinate and possibly consolidate the international organisations in this important area, and appropriate mandates and resources to achieve this goal will be put into place, I would be happy to serve you as CODATA President.