Category Archives: SciDataCon 2016

Posts relating to SciDataCon 2016, held in Denver, Colorado as part of International Data Week, 11-17 September 2016

Humans of Data 16

“I wonder whether any ethics are applied in collection of samples of Ebola and HIV/AIDS in emergency situations. When I talk to doctors about it, they are aware that some researchers from the developed world provide expertise and fund research in pandemic situations. But there are issues on data collection ethics based on informed consent by subjects that deserve scrutiny, given the emergency situations and language barriers under which data is collected. Are there Memoranda of Understanding between African governments and researchers under these conditions? There is a need for transparency and openness. Given the extensive ethical regulations for research on human subjects in the developed world, African countries – which are prone to pandemic disease problems – must engage in the discourse on ethics of data collected under the unique situations that they experience.

It’s the same in the humanities and social sciences: researchers come and take and go. It is rare for research projects to include funding in the initial project proposal for reporting back to the subjects of the research. In Botswana, there was a national scan for indigenous knowledge. We were promised there would be a report back [to the community], but the [research team] never came back. And then researchers are surprised that participants don’t trust them!

Colonialism was first about land resources. Now, without open access, globalisation of research may become the next wave of colonisation. Lower and middle income country researchers need to engage in open debates among themselves on the ownership of data, and how to develop collaboration from collection to analysis with a view to facilitating shared benefits and innovative re-use. Only in this way will the issues of intellectual property rights be negotiated in an equal exchange. All researchers – but especially Africa’s researchers – should reflect on the necessary policy and regulatory frameworks that should be negotiated with local institutions and national governments, as part of their intellectual contributions to evidence-based solutions and sustainable development.

Openness is about exposing your strengths and weaknesses. No one should be intimidated that some have more money. Others have ideas.”

Humans of Data 15

“I’m very passionate about open science. From where I stand in the context of Africa, there’s so much data we create in government, universities or communities. But at the moment, data, which is the base for reports and provides evidence for government decisions, is not accessible to all except the researcher and specialist research reference committees. The Botswana Government has a closed research culture, as does the local research community within the academic and private sector circles. As a librarian, that has always been my concern. When work is done in that way, you find that the resultant data is archived and owned by the funding authority.

The current system is dysfunctional due to a lack of regulatory mechanisms, appropriate follow-up processes and systems for creating national open databases. Without reliable databases for research reports, research data cannot be open or accessible.”

Humans of Data 9

“We need more south–south collaborations. I’d like to approach this and get in touch with people I’ve met here, and I’m trying to identify other people in Latin America that have the same interests. Our data problems might be different from England or Canada or elsewhere in the north. We have a lot of data that might be at risk of disappearing in the next few years, and this might be a bigger problem in developing countries.

I’m also concerned about how the southern hemisphere is going to contribute. How do I get the funds that I need to get the work done that I need to do? Trying to be part of this community is going to be a challenge for financial reasons. I would surely not be here except for GEO and CODATA support; this was very special for me to receive that funding. Otherwise I would miss this incredible opportunity for networking and knowledge sharing.

I think that open science is the only way forward to answer the complex problems that have been presented by society. These problems are not local and involve so many different knowledge domains. We need to do science from a more collaborative perspective to be able to tackle these challenges. Collaboration is what I’m really passionate about. When I return to Brazil I’ll start to talk to people and see how we can go from here.”

Humans of Data 7

“I entered into the data profession about three and a half years ago. I found the community to be very welcoming. The ideas of ethics and sustainability are starting to be brought forward more strongly now. Data aren’t just digits in the memory. They have real world effects in real world situations.

One of the things that drew me particularly to the idea of preserving data, is to build on the research investments that people have made. People spend their lives exploring questions. If the information and data those answers are based on, aren’t kept useable in an understandable way, then the answers themselves are also lost. The end result is so many wasted lives when you add it up. It’s the time invested in the exploring these questions, but even more in a broadly humanitarian way, these answers are pursued to improve the lot of humanity. If the data collected through research are lost, the answers themselves are lost, and so the people, the environmental effects are also lost. So I think that’s my most important concern.

Look, I like efficiency. I like effectiveness. Not taking care of things you’ve spent time making, not making sure they can be used effectively – that’s a waste of everyone’s time and effort. It just bugs me. Data is the starting point for any answers we achieve through research. Let’s not waste that effort. If there’s anything this community could respond more in, it’s the human-related areas – the marketing and advertising of the importance of data and the importance of making sure the data is there to go back to. There’s no reason to reinvent wheels, but improving them is vital.”

HarassMap at SciDataCon on their Data Management Project with IDRC

This post comes from Reem Wael, Director of HarassMap http://harassmap.org/en/: Reem was assisted in part by CODATA and GEO to attend SciDataCon and contributed a paper to a session on ‘Data Sharing in a Development Context: The experience of the IDRC Data Sharing Pilot’ http://www.scidatacon.org/2016/sessions/56/

HarassMap launched five years ago with the mission to end social acceptability of sexual harassment in Egypt. This mission, unexpectedly, led to the accumulation of a lot of data coming from both online and offline sources, and the more we grow the more data we have. Our methodology is to combine online and offline work to achieve our mission and therefore we crowdsource reports on sexual harassment and through our social media outlets, and we receive information from outreach activities and trainings. We analyze this information and give it back to the community in the form of research reports, public campaigns, trainings and policies.

A few years ago, we started receiving requests from researchers who are working on topics that cross-cut with sexual harassment, to access our data. We responded to these requests by providing an excel sheet with the downloaded crowdsourced reports, but this was the limit of our assistance. When an opportunity came along to design and implement a data management plan, supported by IDRC, it was very relevant to our needs. IDRC is an international research organization and was interested in exploring how grantees can have their data more open to the public.

The main point that IDRC focused on is openly sharing the data. However, when we started to work on the project, we realized that the earlier stages are more challenging; which data do we store and how? We have a massive amount of data accumulating in the last five years. Other than the crowdsourced reports and the reports that we receive on social media, we also own a huge library of photographs and video footage, reports from trainings, evaluations from trainings that reflect the impact that we had, reports from outreach activities, and social media posts and replies. We reached some decisions in the planning phase and we are continuing to make these decisions as we move on.

We formed a ‘data management team’ from HarassMap staff who works on research and data and we tried to identify the data that we want to collect, organize and share, raising the following questions: why are we sharing data, and with whom? How can we organize it in a way that would be helpful to researchers, or others who request access to the data? Are there any ethical issues that we need to consider while sharing the data? These questions brought up some challenges. We were not sure what kind of data would be interesting to researchers, for instance. We found that even though crowdsourced reports are more coveted by researchers the more interesting data is the discussions on social media (our posts, including all the comments that we get) in addition to field reports. This data mirrors and tracks the development of myths and misconceptions of sexual harassment, especially when analyzed over a long span of time as it can show if a difference in attitudes and opinions on sexual harassment had occurred.

Embarking upon data management showed some challenges as well. One is a linguistic/technical challenge especially with the crowdsourced reports as we receive them in both English and Arabic. Privacy was a challenge regarding HarassMap’s library of photos and videos since it shows a lot of volunteers since 2010 from whom we did not take consent to share their photos publicly. We did not find an ethical problem with publishing and sharing crowdsourced reports because they are all anonymous, and we also filter them to remove any information that can hold us legally liable such as accusations against people or places by name.

That said, we are now in the process of accumulating and organizing data from the last five years, and putting it on our web server. The next phase – sharing – has its share of challenges. The first and most important is that we must have some kind of screening over who uses our data for various reasons; sometimes researchers completely misuse the data which puts HarassMap in a bad position. For instance claiming that crowdsourced reports reflect ‘hotspots’ of sexual harassment is essentially flawed yet a widely used claim. We always assert that crowdsourced reports provide biased data because there is a huge difference to access to internet and technology based on the affluence of the area and therefore receiving reports from a specific area doesn’t necessarily mean that harassment is more prevalent there, it may mean that people have better access and more knowledge about reporting. At other times, researchers have taken the data without giving credit to HarassMap; and some researchers have asked for the data and then disappeared without informing us of what they wrote.

Being part of this project has benefited HarassMap greatly not only because we started thinking about the idea of sharing our data on searchable engine, but also because we did not know the amount of data that we possess in the first place until we started looking for it. While putting our data completely public is something that HarassMap is still hesitant to do, we are definitely happy to provide researchers and other interested parties with data in a format that is more user friendly.

Humans of Data 4

“So when I was a kid, obviously Star Trek was the thing, because it was our better selves in the 23rd century. Civil rights, women’s rights, all those issues that were happening at that time in the 1960s were simplified in that show. But the thing that got me was the computer. Spock would have this conversation: ‘Computer, what is this thing? What was the global temperature in 1934?’ And there was always an answer. My start with data was looking at how instruments recorded it. As I’ve started to get into managing people, writing code, I’ve realised that we’re the people in someone else’s past. If we don’t get it right, they will suffer. They’ll ask the question, and the computer won’t have an answer. These people are all trying to get to that better 23rd century. It’s slow progress, baby steps. But being able to make sense of the research results that we take now, consolidating that, is really important to me.”

Open Data as a Moving Target: What Does it Take to Allow Reuse?

By Irene Pasquetto

As we all know too well, making all scientific data technically and legally accessible to all researchers is an ambitious task complicated by constantly evolving social and technical barriers. It is fair to say that we are making progresses in this direction. At Scidatacon 2016, we examined several concrete solutions that can facilitate openness of scientific data or, if you prefer, make sure data are FAIR (findable, accessible, interoperable and reusable).

However, it seems that the more we learn about how to make data open, the least we know about how exactly data will be reused by the scientific community, which means by the researchers who generated the data and should have a primary interest in accessing it. Very few empirical studies exist on the extent to which open data are used and reused once deposited in open repositories.

The fulcrum of the problem is that data take many forms, and are produced, managed and used by diverse communities for different purposes. Nevertheless, different stakeholders (publishers, data curators, digital librarians, funders, scientists etc.) bear competing points of view on the kind of policies, values, and infrastructural solutions necessary to make data open. During a session moderated by Christine Borgman (UCLA) and titled “How, When, and Why are Data Open? Competing Perspectives on Open Data in Science”, Matthew Mayernik (National Center for Atmospheric Research), Parsons Mark (National Snow and Ice Data Center) and Irene Pasquetto (UCLA – Center for Knowledge Infrastructures) presented on some of those challenges that make the use and reuse of “open data” such a complicated and heterogeneous process.

Mayernik argued that the integration of the Internet into research institutions has changed the kinds of accountabilities that apply to research data. On one hand, open data policies expect researchers to be accountable for creating data and metadata that support data sharing and reuse in a broad sense, in many cases, to any possible digital user in the world. On the other hand, providing accounts of data practices that satisfy every possible user is in most cases impossible.

In his talk, Parsons effectively showed that data access is an ongoing process, not a one-time event. Parsons and his team examined how the data repositories products and their curation have evolved over time in response to environmental events and increasing scientific and public demand over several decades. The products have evolved in conjunction with the needs of a changing and expanding designated user community. In other words, Parsons’ case study shows that it is difficult to predict the users of a data service because new and unexpected audiences (with specific needs) could emerge at any time. Parsons also argued that, for this reason, “data generators” may not be the best individuals to predict future uses of their own data.

Because open data users change over time, it is also necessary to built open repositories that provide data in formats flexible enough to allow different approaches to data analysis and integration, for different audiences. This was the point made by Pasquetto, whose case study is a consortium for data sharing in craniofacial research, with a focus on the subfield of developmental/evolutionary biology that recently adopted genomics approaches to knowledge discovery. Pasquetto found flexible data integration to be a necessary precursor to using and reusing data. “Data integration work” is the most contested and problematic task faced by the community, where data need to be integrated at two or more levels and these levels require extensive collaboration between engineers, biologists, and bioinformaticians.

Borgman also presented a paper on the beneath of Ashley Sands, who recently graduated from the department of Information Studies at UCLA and is now senior program officer at the Institute of Museum and Library Services in Washington DC. This talk examined characteristics of openness in the collection, dissemination, and reuse of data in two astronomy sky survey case studies: the Sloan Digital Sky Survey (SDSS) and the Large Synoptic Survey Telescope (LSST). Discussion included how the SDSS and LSST data, and datasets derived from the projects by end users, become available for reuse. Sands found that the rate at which data are released, the populations to which the data are made open, the length of time data creators plan to make the data available, the scale at which these endeavors take place, and the stages of these two projects all have great impact to the extent in which data and then reused.

Moral of the story: open data is a fast moving target. In order to enable reuse, data repositories better start to run.

Humans of Data 3

“I find it relaxing to work with data. I’m a mathematician by training and much more into applied mathematics, so I find recursive formulas very relaxing and linear algebra is like a fun puzzle, like a crossword. I like problem solving. ‘Big data’ is an excellent field for problem solving. I like finding elegant solutions to complex problems. I approach problem solving slightly off-kilter from others – I would often get weird grades in school, but it also means that if people give me problems they’re struggling with, I could look at it and come up with something different from them. This is my first data science meeting. I’m enjoying the opportunity and being around mathematicians and database people and folks who get excited by data. And I’m pleased that there are other women I can talk to.”

Humans of Data 2

“One of the coolest thing is starting out as a student in the research data management field, being early in my career, and then being able to interact with the same people over time. I feel like I’m kind of growing up as an individual. I feel I can say, hey, you guys made an impact on what I do, and now I can give back.”

Humans of Data 1

“I think you need to express yourself the way you feel you should, because what really matters at this conference is that we’re all interested in making data available, accessible and preserving it, and we shouldn’t feel that we have to sacrifice who we are in part or whole, in order to do our work.

I hear far more people who are complimentary about the way I dress than not, so it’s not like it’s problematic. But it shouldn’t matter anyway. We have to just keep being who we are, and the other people will catch up.”

CODATA Blog

News from the CODATA community and from Simon Hodson, CODATA Executive Director