{"id":1533,"date":"2018-10-18T10:51:31","date_gmt":"2018-10-18T10:51:31","guid":{"rendered":"http:\/\/codata.org\/blog\/?p=1533"},"modified":"2018-10-18T10:56:35","modified_gmt":"2018-10-18T10:56:35","slug":"tony-hey-candidate-for-the-codata-executive-committee-and-codata-president","status":"publish","type":"post","link":"https:\/\/codata.org\/blog\/2018\/10\/18\/tony-hey-candidate-for-the-codata-executive-committee-and-codata-president\/","title":{"rendered":"Tony Hey: Candidate for the CODATA Executive Committee and CODATA President"},"content":{"rendered":"<p><em><a href=\"https:\/\/codata.org\/blog\/wp-content\/uploads\/2018\/10\/Tony-Photo2.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignright wp-image-1541\" src=\"https:\/\/codata.org\/blog\/wp-content\/uploads\/2018\/10\/Tony-Photo2-300x236.jpg\" alt=\"\" width=\"254\" height=\"200\" srcset=\"https:\/\/codata.org\/blog\/wp-content\/uploads\/2018\/10\/Tony-Photo2-300x236.jpg 300w, https:\/\/codata.org\/blog\/wp-content\/uploads\/2018\/10\/Tony-Photo2.jpg 575w\" sizes=\"auto, (max-width: 254px) 100vw, 254px\" \/><\/a>This is the fifteen\u00a0in the series of short statements from candidates in the forthcoming CODATA\u00a0Elections at the General Assembly to be held on 9-10 November in Gaborone, Botswana, following\u00a0<a href=\"http:\/\/internationaldataweek.org\/\">International Data Week<\/a>. Tony Hey is a candidate for the CODATA Executive Committee as CODATA President. \u00a0<strong>He was nominated by UK.<\/strong><\/em><\/p>\n<p><strong>CODATA\u2019s Mission<\/strong><\/p>\n<p>\u201cCODATA exists to promote global collaboration to advance Open Science and to improve the availability and usability of data for all areas of research.\u00a0 CODATA supports the principle that data produced by research and susceptible to be used for research should be <a href=\"http:\/\/www.codata.org\/strategic-initiatives\/international-data-policy-committee\">as open as possible and as closed as necessary<\/a>.\u00a0 CODATA works also to advance the interoperability and the usability of such data: research data should be <a href=\"http:\/\/www.codata.org\/working-groups\/fair-data-expert-group\">FAIR (Findable, Accessible, Interoperable and Reusable)<\/a>. By promoting the policy, technological and cultural changes that are essential to promote Open Science, CODATA helps advance ISC\u2019s vision and mission of advancing science as a global public good\u201d [1]<\/p>\n<p><strong>Preface<\/strong><\/p>\n<p>In my present position as the Chief Data Scientist of the UK\u2019s Science and Technology Facilities Council (STFC), I am based at the Rutherford Appleton Laboratory (RAL), on the Harwell Campus near Oxford. The Harwell site hosts the Diamond Synchrotron, the ISIS Neutron Source and the UK\u2019s Central Laser Facility. \u00a0My primary role is to support the university users of these large-scale experimental facilities at RAL in managing and analyzing their research data. \u00a0The users of these facilities now perform experiments that generate increasingly large and complex datasets which need to be curated, analyzed, visualized and archived, and their new scientific discoveries published in a manner consistent with the FAIR principles. In addition, I work with the Hartree Supercomputing Centre at the STFC\u2019s Daresbury Lab near Manchester. The Hartree Centre works mainly with industry and supports their computer modelling and data science requirements.<\/p>\n<p>I am therefore intimately acquainted with the challenges of open science and believe, thanks in part to the activities of CODATA, together with its fellow ISC organization, the World Data System (WDS) and now also with their younger partner, the Research Data Alliance (RDA), that the global scientific research community has made significant progress towards the goals of the CODATA mission over the last five years. However, there is still much more to do before we can realize anything close to the Jim Gray\u2019s vision of the full text of all publications online and accessible, linked to the original datasets with sufficient metadata that other researchers can reuse and add new data to generate new scientific discoveries. In his last talk before he went missing at sea, he summed up this vision in the \u2018pyramid\u2019 diagram below [2]:<\/p>\n<p><a href=\"https:\/\/codata.org\/blog\/wp-content\/uploads\/2018\/10\/Tony-Hey_blog_pic.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-1534 aligncenter\" src=\"https:\/\/codata.org\/blog\/wp-content\/uploads\/2018\/10\/Tony-Hey_blog_pic.png\" alt=\"\" width=\"469\" height=\"336\" srcset=\"https:\/\/codata.org\/blog\/wp-content\/uploads\/2018\/10\/Tony-Hey_blog_pic.png 444w, https:\/\/codata.org\/blog\/wp-content\/uploads\/2018\/10\/Tony-Hey_blog_pic-300x215.png 300w\" sizes=\"auto, (max-width: 469px) 100vw, 469px\" \/><\/a><\/p>\n<p>The European Open Science Cloud (EOSC) has a similar vision but is aiming to provide a much more detailed roadmap towards realizing a vision of global research that is Findable, Accessible, Interoperable and Reusable (FAIR). The work of the EOSCpilot project to define a core set of metadata properties \u2013 the EOSC Dataset Minimum Information or EDMI \u2013 that are \u201csufficient to enable data to be findable by users and suitably \u2018aware\u2019 programmatic services\u201d is a good start [3]. The Australian Research Data Commons (ARDC) established in 2018, subsuming the Australian National Data Service (ANDS), also has a similar vision.<\/p>\n<p><strong>My Vision for CODATA<\/strong><\/p>\n<p>I very much support the three major strategic programs put forward in CODATA\u2019s Strategic Plan 2013 \u2013 2018, namely:<\/p>\n<ul>\n<li>Data Principles and Practice<\/li>\n<li>Frontiers of Data Science<\/li>\n<li>Capacity Building<\/li>\n<\/ul>\n<p>However, given the promising developments of the last five years it is now time to develop a third strategic plan covering the next five years of the CODATA organization. Development of this new strategic plan must be a major priority for CODATA and it will be important to reach out to all the relevant national and international stakeholder organizations for their input. However, in addition to CODATA\u2019s traditional stakeholders, I would also like to learn from the experience of other major efforts in this space. For example, from the US, this could include input from the NIH\u2019s National Library of Medicine, the DOE\u2019s OSTI organization and the NSF\u2019s DataONE project. From Europe, there will be much activity in creating an implementation of the European Open Science Cloud (EOSC). I would also look for input from other major data science initiatives in Asia and Australia.<\/p>\n<p>In addition to developing detailed plans and deliverables for the three broad CODATA priority areas for the next five years, I would like to give my support to two other areas. During my career in data-intensive science &#8211; in the UK with e-Science and in my work with Microsoft Research in the US \u2013 I have worked closely with universities and funding agencies in Europe, North and South America, Asia and Australia. I now think it is important to dedicate more attention to Africa where I think CODATA can play a significant role. I am therefore personally very supportive of the existing CODATA initiative to develop an African Open Science Platform and would look for ways to extend this initiative and increase its impact. One way in which to do this is to harness CODATA\u2019s global reach and influence which can successfully bring together countries at many different levels of economic development. The international SKA project will also generate many interesting computing, data science and networking challenges in Africa.<\/p>\n<p>The second focus I would like to develop is related to my present role as leader of the Scientific Machine Learning research group at RAL. There is now much activity world-wide in the application of the latest advances in AI and Machine Learning technologies to scientific data. This is one of the few areas where the academic research community has large and complex data sets that can compete with the \u2018Big Data\u2019 available to industry. Extracting new scientific insights from these datasets will require the use of advanced statistical techniques, including Bayesian methods and \u2018deep learning\u2019 technologies. In addition, an extensive education program to train researchers in the application of these data analytic technologies will be necessary and can build upon practical experience in applying such methods to \u2018Big Scientific Data.\u2019 In this way CODATA can help train a new generation of data analysts who are not only able to generate new insights from scientific data but also to spur innovation with industry and aid economic development.<\/p>\n<p>While at Microsoft Research, I was a founding Board member of the RDA organization. As an RDA \u00a0Board member, I liaised extensively with both the NSF in the USA, and with the Commission in Europe, and assisted in facilitating the constructive cooperation of RDA with CODATA. I will therefore bring extensive management experience to the leadership of CODATA \u2013 from my experience in the university sector as research group leader, department chair and dean of engineering, in UK research funding councils as a program director and chief data scientist, and in industry as manager of a globally distributed outreach team. I am disappointed to see the absence of many European countries from the CODATA membership and, through my experience in European research projects, I would seek to encourage these missing nations to become members of the organization. In addition, in my role at Microsoft Research, I spent considerable time visiting universities and funding agencies in Central and South America, and in Asia. I believe there is considerable potential to interest non-member countries in these regions in the relevance of the data science agenda of CODATA. Finally, although I will certainly bring my vision, enthusiasm and energy to the role of CODATA President, I believe that we must harvest the energy and enthusiasm of the entire CODATA community to take the organization forward to a new level of influence and effectiveness.<\/p>\n<p>&nbsp;<\/p>\n<p><strong>My Background<\/strong><\/p>\n<p>I am standing for election to the CODATA Presidency because I have long been an advocate for Open Access and Open Science. My passion for this topic and for the era of \u2018Big Scientific Data\u2019 dates back to the years from 2001 to 2005 when I was director of the UK\u2019s eScience program. With Anne Trefethen, I wrote a paper in 2003 with the title \u201c<em>The Data Deluge: An e-Science Perspective<\/em>\u201d. This paper was certainly one of the earliest papers to talk about the transformative effects on science of the imminent deluge of scientific data [4]. In 2006, I was invited to give a keynote talk on eScience at the CODATA Conference in Beijing. While a Vice President in Microsoft Research, we celebrated the achievements of my late colleague, the Turing Award winner Jim Gray, by publishing a collection of essays in 2009 that illustrated the emergence of a new \u2018Fourth Paradigm\u2019 of Data-Intensive Science [4].<\/p>\n<p>During the eScience program, which received significant funding from both the UK Research Councils and from Jisc, the UK research community explored many issues about the scientific data pipeline that are still important and relevant today. One project, for example, examined the preservation and sharing of scientific workflows. Another project looked in detail at recording the provenance of a dataset. This effort ultimately led in 2013 to the emergence of the W3C \u2018PROV\u2019 standard for provenance. Several other eScience projects explored the use of RDF and semantic web technologies such as OWL and SPARQL for enhancing research metadata. Although these technologies have proved popular with several academic research communities, it is probably fair to say that they have not so far been broadly adopted by most research communities nor by the major IT companies. In my role as chair of Jisc\u2019s research committee, I supported the establishment of the Digital Curation Centre (DCC) in Edinburgh in 2004. The DCC was one of the first organizations to propose a set of guidelines for scientific data management plans (DMPs). The Jisc research committee also funded the National Centre for Text Mining (NaCTeM) in Manchester which offers a broad range of text-mining services. In the age of \u2018Big Scientific Data\u2019, high-bandwidth, end-to-end networking performance is an increasingly necessary element of a nation\u2019s e-infrastructure. As a result, the Jisc research committee funded Janet, the UK\u2019s NREN, to follow the lead of SURFnet in the Netherlands by introducing optical fibre \u2018lambda\u2019 technology. \u00a0Janet can now provide dedicated \u2018lightpath\u2019 support to users requiring long-term, persistent high-volume data transfers between locations. I believe that these e-Science examples still have relevance to CODATA and to the practice of data science today.<\/p>\n<p><strong>Progress towards Open Science, Open Access and \u2018Open\u2019 Data<\/strong><\/p>\n<p>Open science must start with genuine open access to the full text of research papers. Before becoming Director of the UK\u2019s eScience program, I was Head of Department of the Electronics and Computer Science (ECS) Department and then Dean of Engineering at the University of Southampton. Recognizing the crisis that university library budgets were facing in terms of rising journal subscriptions, with the support of Wendy Hall, Stevan Harnad and Les Carr, the ECS Department funded and developed the well-known ePrints repository software and established one of the first \u2018Green Open Access\u2019 institutional repositories. In the UK, there is now wide-spread deployment of university research repositories that contain the full text of research papers, albeit with access usually subject to a publisher embargo period of 6 or 12 months.<\/p>\n<p>By contrast, in the US, a historic memo from the White House Office of Science and Technology Policy (OSTP) in 2013 required US funding agencies <em>\u201cto develop a plan to support increased public access to the results of research funded by the Federal Government.\u201d<\/em> More importantly for CODATA\u2019s agenda, the memo also specified that the \u2018results of research\u2019 include not only the scientific research papers but also the accompanying research data. It defined research data as <em>\u201cthe digital recorded factual material commonly accepted in the scientific community as necessary to validate research findings including data sets used to support scholarly publications, but does not include laboratory notebooks, preliminary analyses, drafts of scientific papers, plans for future research, peer review reports, communications with colleagues, or physical objects, such as laboratory specimens.\u201d<\/em><\/p>\n<p>The OSTP memo has led to all the major US funding agencies developing open science policies and establishing research repositories that contain the full text of research papers linked to the corresponding datasets generated by the researchers that they fund. The two most prominent repository systems are the National Institutes of Health\u2019s PubMed Central with its associated databases, and the Department of Energy\u2019s PAGES system managed by the Office of Scientific and Technical Information (OSTI). In contrast to this US funding agency centred view, UK Research Councils now require all researchers to have a Data Management Plan in their research proposals and look to the universities and specialist subject repositories to be responsible for the outputs of research that they fund. For example, one Council, the Engineering and Physical Sciences Research Council, now requires that \u201c<em>Research organizations will ensure that EPSRC-funded research data is securely preserved for a minimum of 10 years from the date that any researcher \u2018privileged access\u2019 period expires<\/em>\u201d.<\/p>\n<p>The developments described above have taken place over the last five years and constitute significant progress toward open science. I am therefore optimistic that CODATA, together with WDS and RDA, and supported by national and international research funding agencies, can continue to make major strides towards changing the culture of researchers about their research data. My optimism is further fuelled by the steady increase in research registrations for ORCID IDs and DOIs for datasets and software by university researchers.<\/p>\n<p>Two very recent developments are also exciting. The first is the announcement of Science Europe\u2019s \u2018cOAlition\u2019 for open access. Eleven European research funding agencies have agreed to focus their open science efforts on the very ambitious \u2018Plan S\u2019 &#8211; which aims at \u2018Making Open Access a Reality by 2020\u2019 [5]. The second notable development is Google\u2019s introduction of a new Dataset Search service which has the potential to become a significant aid to data discoverability. The service makes use of the industry supported \u2018schema.org\u2019 initiative which aims to add some semantic information to the metadata describing the dataset.<\/p>\n<p><em>\u00a0\u201cSchema.org is a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond. Schema.org vocabulary can be used with many different encodings, including RDFa, Microdata and JSON-LD. These vocabularies cover entities, relationships between entities and actions, and can easily be extended through a well-documented extension model\u201d<\/em> [7].<\/p>\n<p>The recent work by the ELIXIR collaboration on the Bioschemas extension to schema.org is intended to improve data interoperability in life sciences research. Bioschemas is a collection of specifications that provide guidelines to facilitate a more consistent adoption of schema.org markup within the life sciences [8]. Such initiatives and\u00a0 are important indicators of the direction of open science.\u00a0 I therefore believe that a pragmatic approach to machine actionable metadata that is based on schema.org and subject-specific extensions represents a practical way forward for the majority of scientific research communities.<\/p>\n<p><strong>Tony Hey<br \/>\n<\/strong><strong>Rutherford Appleton Laboratory<br \/>\n<\/strong><strong>Science and Technology Facilities Council<br \/>\n<\/strong><strong>Harwell Campus<br \/>\n<\/strong><strong>Didcot, OX11 0QX, UK<\/strong><\/p>\n<p><strong>\u00a0<\/strong><strong>References<\/strong><\/p>\n<p>[1] CODATA Mission statement, <a href=\"http:\/\/www.codata.org\/about-codata\/our-mission\">http:\/\/www.codata.org\/about-codata\/our-mission<\/a><\/p>\n<p>[2] \u201cThe Fourth Paradigm: Data-Intensive Scientific Discovery\u201d, edited by Tony Hey, Stewart Tansley and Kristin Tolle, published by Microsoft Research, 2009, ISBN: 978-0-9825442-0-4<\/p>\n<p>[3] EOSCpilot, D6.3: 1<sup>st<\/sup> Report on Data Interoperability: Findability and Interoperability, <a href=\"https:\/\/eoscpilot.eu\/sites\/default\/files\/eoscpilot-d6.3.pdf\">https:\/\/eoscpilot.eu\/sites\/default\/files\/eoscpilot-d6.3.pdf<\/a><\/p>\n<p>[4] Tony Hey and Anne Trefethen.\u00a0 \u201cThe Data Deluge: An e-Science Perspective\u201d, Chapter in \u201cGrid Computing &#8211; Making the Global Infrastructure a Reality\u201d, edited by F Berman, G C Fox and A J G Hey, Wiley, pp.809-824 (2003)<\/p>\n<p>[5] Plan-S, <a href=\"https:\/\/www.scienceeurope.org\/wp-content\/uploads\/2018\/09\/cOAlitionS_Press_Release.pdf\">https:\/\/www.scienceeurope.org\/wp-content\/uploads\/2018\/09\/cOAlitionS_Press_Release.pdf<\/a><\/p>\n<p>[6] Google Dataset Search, <a href=\"https:\/\/ai.googleblog.com\/2018\/09\/building-google-dataset-search-and.html\">https:\/\/ai.googleblog.com\/2018\/09\/building-google-dataset-search-and.html<\/a><\/p>\n<p>[7] Schema.org, <a href=\"https:\/\/schema.org\/\">https:\/\/schema.org\/<\/a><\/p>\n<p>[8] Bioschemas, <a href=\"http:\/\/bioschemas.org\/\">http:\/\/bioschemas.org\/<\/a><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is the fifteen\u00a0in the series of short statements from candidates in the forthcoming CODATA\u00a0Elections at the General Assembly to be held on 9-10 November in Gaborone, Botswana, following\u00a0International Data Week. Tony Hey is a candidate for the CODATA Executive Committee as CODATA President. \u00a0He was nominated by UK. CODATA\u2019s Mission \u201cCODATA exists to promote [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[28],"tags":[],"class_list":["post-1533","post","type-post","status-publish","format-standard","hentry","category-codata-elections-2018"],"_links":{"self":[{"href":"https:\/\/codata.org\/blog\/wp-json\/wp\/v2\/posts\/1533","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/codata.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/codata.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/codata.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/codata.org\/blog\/wp-json\/wp\/v2\/comments?post=1533"}],"version-history":[{"count":7,"href":"https:\/\/codata.org\/blog\/wp-json\/wp\/v2\/posts\/1533\/revisions"}],"predecessor-version":[{"id":1545,"href":"https:\/\/codata.org\/blog\/wp-json\/wp\/v2\/posts\/1533\/revisions\/1545"}],"wp:attachment":[{"href":"https:\/\/codata.org\/blog\/wp-json\/wp\/v2\/media?parent=1533"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/codata.org\/blog\/wp-json\/wp\/v2\/categories?post=1533"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/codata.org\/blog\/wp-json\/wp\/v2\/tags?post=1533"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}