Category Archives: SciDataCon 2016

Posts relating to SciDataCon 2016, held in Denver, Colorado as part of International Data Week, 11-17 September 2016

Scidatacon: Opening keynotes

It was a pleasure to start off the first full day of SciDataCon with a keynote from Elaine M Faustman, Professor and Director at the Institute for Risk Analysis and Risk Communication, University of Washington and member of the ICSU World Data System Scientific Committee.  Professor Faustman’s keynote talk, ‘Challenges and Opportunities with Citizen Science:  How a decade of opening1experiences have shaped our forward paths’, introduced a welcome early focus on the importance of rigorous ethical approaches to ‘citizen science’ research projects. Looking back to the early roots of the knowledge practices we now call ‘science’, Faustman reminded us that of course the roots of these practices can be found in the work of European gentleman scientists and their cabinets of curiosities.   She also situated contemporary citizen science practice in the US legislatory framework of the US citizen’s right to know, work which has been underpinned by standards and acts since the 1940s onwards.

Reflecting over a decade of citizen science practice in the environment and public health domains, Faustman provided examples of projects where citizens are not only research subjects but are centrally influential in the work, to the point where they express ownership of the project alongside the university team.  Discussion focused on the importance of the abilty of research participants to influence the direction and scope of the research project, to provide feedback on its progress, and to have access to the data accrued in order to be able – in case of public health projects at least – to use it to guide their ownopening4 decision-making.  The message of deep ethical engagement and building respectful relationships with participants set the scene for a day in which ethical issues reverberated.

The second keynote was by Simon Cox, Research Scientist, Environmental Informatics, CSIRO Land and Water, Clayton, Melbourne, Australia.  In his talk, ‘What does that symbol mean? – controlled vocabularies and vocabulary services’, Cox raised a very pragmatic point about the widespread problem of non-systematic use of symbols – and keywords – in data.  He demonstrated that we assume symbols and keywords have some sort of shared meaning, at least in a given community, but that the reality is much less systematic. Symbols and abbreviations with no widely used consistent meaning are often used by researchers when creating data. Populaopening6r terms describing volume can mean entirely different things in different countries.  And even symbols of terms describing a widely understood measurement, such as the metre, can be problematic link to a common source: the International Bureau of Weights and Measures provides a definition, which can be found via a given URI. But the fact that this URI has changed regularly from year to year disrupts any expectation of a stable, enduring location for this definition.

Cox suggested a couple of actions to mitigate this situation. Firstly, a new CODATA task group on coordinating data standards will take this work forward. Secondly, the Global Agricultural Concept Scheme – GACS – is the result of three defining sources from agricultural research banding together to deduplicate their respective vocabularies and make them interoperable for agricultural researchers. Cox noted that the technical job is not large but that – in confluence with Faustman’s earlier message – the really big job is achieving the buy-in from the community in question.

So the pesky human dimension appears right at the start of International Data Week!  More information on the keynotes is at

Laura Molloy is a doctoral researcher at the Oxford Internet Institute and the Ruskin School of Art, University of Oxford. She is on Twitter at @LM_HATII.

How to Address Data Challenges in the Biomedical field: Solutions for Data Access, Sharing and Reuse

Irene Pasquetto is a PhD Student in Information Studies at UCLA.

As scientists in the biomedical fields are generating more and more diverse data, the real elaine-m-faustmanquestion today is not only how to make data “sharable” or “open”, but also, and especially, useful and reusable. At Scidatacon 2016, speakers from funding agencies, research universities, data research institutions, and the publishing industry came together to try to address this key question.

Around 20 highly interdisciplinary papers organized in four busy sessions addressed the problem from different perspectives, while agreeing on an essential point: developing new, open frameworks and guidelines is not enough. Indeed, what characterized this last edition of Scidatacon was a focus on proposing and discussing applicable solutions that can address the management, use, and reuse of large scale datasets in biomedicine today, right now.

Three main themes emerged across the sessions:img_20160912_125708

  1. How to enable scientific reproducibility.
  2. How to apply data science techniques to biological research.
  3. How to make heterogeneous bio-databases globally interoperable.


Leslie McIntosh (Director, Center for Biomedical Informatics Washington University in St. Louis) moderated session 1, which focused on the first topic: Solving the problem of reproducibility in science, starting from making jennie-larkin-biomedical-data-stewardshipbiomedical data reusable to this end.

Tim Errington (Center for Open Science) offered a clear and useful distinction between reproducibility, which he defined as the possibility of re-running the experiment the way it was originally conducted, and replicability, which is the possibility of getting the same results by reusing the same methods of data collection and analysis with novel data. Errington invited the audience to reflect on two main issues: first, incentives for individual success are focused on “getting it  published, not getting it right,” and second, instead of focusing on problems with either open access or open data, we should think about “open workflows” that include the whole process of scientific research.

Similarly, Anthony Juehne (Washington University in St. Louis) talked about how to address reproducibility issues step by step across the entire “scientific workflow”. Juehne presented to possible solution to the problem: “Wrap, Link, and Cite” data products OR “Contain and Visualize” them using virtual machines.

Finally, Cynthia Hudson Vitale exposed a rarely addressed aspect in the reproducibility community, which is the fundamental role played by biocurators. While their work is often not acknowledged in the community, biocurators are those who de-facto do the hard job of cleaning and organizing the data in a way that can be used to reproduce experiments. Cynthia proposed some concrete solutions to the problem. First, domain reproducibility articles need to include a greater variety of curation treatments. And, second, curators need to publish in domain journals to ensure the full breadth of curation treatments is discussed with researchers.


A second main theme that emerged in session 2 was how to apply recent statistical and jiawei-han-large-scale-biological-text-mining-and-data-analysis jiawei-han-panel-large-scale-biological-text-mining-and-data-analysiscomputational cutting-edge techniques for data science (machine learning algorithms, deep learning text mining) to the biomedical knowledge discovery process. Introduced and moderated by Jiawei Han (University of Illinois at Urbana-Champaign), computer scientists, biologists and biomedical researchers working on biological text mining presented overviews and surveys on the topic.

Beth Sydney Linas and Wendy Nilsen from IIS, the Division of Information and Intelligent Systems (NSF – National Cancer Moonshot), gave an overview of how data science can be used to uncover the underlying mechanisms that drive cancer and the development of methods that will allow clinical researchers to eliminate the disease. The researchers concluded that the future of novel computing (especially machine learning, artificial intelligence, network analysis, database mining as well as bioinformatics and image analysis) needs to be directed also as it relates to health related research.

Elaine M. Faustman (University of Washington) presented an annotated database of DNA and protein sequences derived from environmental sequences showing AR in laboratory experiments. The database aims to help fulfill the current lack of knowledge on the relations between antibiotics resistant genes present in the environment and genomic sequences derived from clinical antibiotic resistant isolates.

Jiawei Han, Heng Ji, Peipei Ping, Wei Wang presented results from their analysis of massive collection of biomedical texts from medical research literature using semi-supervised text mining. The researchers argued that interesting biological entities and relationships that are currently “lost” in unstructured data can be efficiently re-discovered by applying bio-text mining techniques to PubMed massive biological text corpus.


Finally, over 10 presenters in session 3 and 4 shared their own first hand experiences in susanna-assunta-sansone-biomedical-data-stewardshipmanaging and building biomedical integrated databases and making them interoperable. The biomedical research community and funders seek to make their research resources “FAIR”: findable, accessible, interoperable, and reusable, and also seek to strengthen incentives to support improved data stewardship by addressing incentives, such as data citation. Speakers shared a common concern: how to create data standards and practices from the bottom-up. As suggested by the speakers, it is necessary to be aware of existing local, cultural and social incentives, clearly define possible audiences, and involve the scientists in the database-building process. Individual projects can be consulted at the sessions’ webpage: