How to Address Data Challenges in the Biomedical field: Solutions for Data Access, Sharing and Reuse

Irene Pasquetto is a PhD Student in Information Studies at UCLA.

As scientists in the biomedical fields are generating more and more diverse data, the real elaine-m-faustmanquestion today is not only how to make data “sharable” or “open”, but also, and especially, useful and reusable. At Scidatacon 2016, speakers from funding agencies, research universities, data research institutions, and the publishing industry came together to try to address this key question.

Around 20 highly interdisciplinary papers organized in four busy sessions addressed the problem from different perspectives, while agreeing on an essential point: developing new, open frameworks and guidelines is not enough. Indeed, what characterized this last edition of Scidatacon was a focus on proposing and discussing applicable solutions that can address the management, use, and reuse of large scale datasets in biomedicine today, right now.

Three main themes emerged across the sessions:img_20160912_125708

  1. How to enable scientific reproducibility.
  2. How to apply data science techniques to biological research.
  3. How to make heterogeneous bio-databases globally interoperable.


Leslie McIntosh (Director, Center for Biomedical Informatics Washington University in St. Louis) moderated session 1, which focused on the first topic: Solving the problem of reproducibility in science, starting from making jennie-larkin-biomedical-data-stewardshipbiomedical data reusable to this end.

Tim Errington (Center for Open Science) offered a clear and useful distinction between reproducibility, which he defined as the possibility of re-running the experiment the way it was originally conducted, and replicability, which is the possibility of getting the same results by reusing the same methods of data collection and analysis with novel data. Errington invited the audience to reflect on two main issues: first, incentives for individual success are focused on “getting it  published, not getting it right,” and second, instead of focusing on problems with either open access or open data, we should think about “open workflows” that include the whole process of scientific research.

Similarly, Anthony Juehne (Washington University in St. Louis) talked about how to address reproducibility issues step by step across the entire “scientific workflow”. Juehne presented to possible solution to the problem: “Wrap, Link, and Cite” data products OR “Contain and Visualize” them using virtual machines.

Finally, Cynthia Hudson Vitale exposed a rarely addressed aspect in the reproducibility community, which is the fundamental role played by biocurators. While their work is often not acknowledged in the community, biocurators are those who de-facto do the hard job of cleaning and organizing the data in a way that can be used to reproduce experiments. Cynthia proposed some concrete solutions to the problem. First, domain reproducibility articles need to include a greater variety of curation treatments. And, second, curators need to publish in domain journals to ensure the full breadth of curation treatments is discussed with researchers.


A second main theme that emerged in session 2 was how to apply recent statistical and jiawei-han-large-scale-biological-text-mining-and-data-analysis jiawei-han-panel-large-scale-biological-text-mining-and-data-analysiscomputational cutting-edge techniques for data science (machine learning algorithms, deep learning text mining) to the biomedical knowledge discovery process. Introduced and moderated by Jiawei Han (University of Illinois at Urbana-Champaign), computer scientists, biologists and biomedical researchers working on biological text mining presented overviews and surveys on the topic.

Beth Sydney Linas and Wendy Nilsen from IIS, the Division of Information and Intelligent Systems (NSF – National Cancer Moonshot), gave an overview of how data science can be used to uncover the underlying mechanisms that drive cancer and the development of methods that will allow clinical researchers to eliminate the disease. The researchers concluded that the future of novel computing (especially machine learning, artificial intelligence, network analysis, database mining as well as bioinformatics and image analysis) needs to be directed also as it relates to health related research.

Elaine M. Faustman (University of Washington) presented an annotated database of DNA and protein sequences derived from environmental sequences showing AR in laboratory experiments. The database aims to help fulfill the current lack of knowledge on the relations between antibiotics resistant genes present in the environment and genomic sequences derived from clinical antibiotic resistant isolates.

Jiawei Han, Heng Ji, Peipei Ping, Wei Wang presented results from their analysis of massive collection of biomedical texts from medical research literature using semi-supervised text mining. The researchers argued that interesting biological entities and relationships that are currently “lost” in unstructured data can be efficiently re-discovered by applying bio-text mining techniques to PubMed massive biological text corpus.


Finally, over 10 presenters in session 3 and 4 shared their own first hand experiences in susanna-assunta-sansone-biomedical-data-stewardshipmanaging and building biomedical integrated databases and making them interoperable. The biomedical research community and funders seek to make their research resources “FAIR”: findable, accessible, interoperable, and reusable, and also seek to strengthen incentives to support improved data stewardship by addressing incentives, such as data citation. Speakers shared a common concern: how to create data standards and practices from the bottom-up. As suggested by the speakers, it is necessary to be aware of existing local, cultural and social incentives, clearly define possible audiences, and involve the scientists in the database-building process. Individual projects can be consulted at the sessions’ webpage: