Monthly Archives: March 2015

ISI-CODATA Big Data Workshop as Word Clouds

Shiva_KhanalThis post was written by Shiva KhanalResearch Officer with the Department of Forest Research and Survey in Nepal.  Shiva was one of the international scholars sponsored by CODATA to attend the ISI-CODATA International Training Workshop on Big Data.

This March two associated events were co-organized by the Committee on Data for Science and Technology (CODATA) and the Indian Statistical Institute (ISI): the International Seminar on Data Science (19-20, Mar 2015)  and ISI CODATA International Training Workshop on Big Data (9-18, Mar 2015). Those events in ISI Bangalore, India, covered a wide range of talks and presentations related to big data with presenters from diverse background such as academic community, business sectors and data scientists.

One way to visualize the focus of the program would be to make plot of terms that were more frequent. I obtained the schedule of presentations and tutorials during seminar (http://drtc1.isibang.ac.in/datascience/schedule.html) and training workshop (http://drtc1.isibang.ac.in/bdworkshop/schedule.html) and generated a word cloud using the R package – wordcloud.

The R code along with the dropbox link to the data is provided. Pasting this on R console will give the word cloud shown here:

isi_codata_word_cloud

Here is the code:

###########################################

#load packages

library(wordcloud)

library(tm)

#read the presentation details

textf = readLines(“http://dl.dropboxusercontent.com/u/111213395/text_file_presentation_titles.txt”)

# get a column of strings.

text_corpus <- Corpus(VectorSource(textf))

# create document term matrix and apply transformations

tdm = TermDocumentMatrix(text_corpus,

      control = list(removePunctuation = TRUE, stripWhitespace=TRUE,

                     stopwords = c(stopwords()),PlainTextDocument =TRUE, removeNumbers = TRUE, tolower = TRUE))

m <- as.matrix(tdm)

v <- sort(rowSums(m),decreasing=TRUE)

d <- data.frame(word = names(v),freq=v)

pal <- brewer.pal(6,”Dark2″)

pal <- pal[-(1)]

#plot the word cloud and save as png

png(“test.png”,width=3.25,height=3.25,units=”in”,res=1200)

wordcloud(d$word,d$freq,c(4,.3),2,,TRUE,TRUE,.15,pal)

dev.off()

###########################################

I also created a word cloud from twitter using #isibigdata (total ~100 tweets). Unlike, the  text based word cloud above, twitter extraction required little bit of customization (setting up credentials for a twitteR session), but otherwise the plotting codes are almost the same).

isibigdata tweet cloud

How do we describe nanomaterials? A Uniform Description System

Nanomaterials are substances containing hundreds of thousands to hundreds of millions atoms (with leeway at both ends of that range) that have rapidly evolved into important substances with the potential to impact virtually every area of our physical world – health, food, engineering, transportation, energy.  In many cases, nanomaterials already are part of our world.

Unlike “normal” engineering materials, there is no accepted way to describe nanomaterials such that we know exactly which nanomaterial is being discussed, reported on, regulated, bought, or put into a commercial product.  Simple chemical nomenclature does not suffice. Nanomaterials have solid-like aspects beyond normal chemistry.  Systems used for engineering materials such as ceramics, metals, and polymers, do not capture the nanoscale features of form, size, surfaces, etc., that impart special properties to nanomaterials.

Nanomaterials are also of intense interest of many disciplines, from chemistry, physics, and material science to food, medicine, and nutrition.  The user communities are equally diverse: researchers, product designers, purchasers, regulators, health experts, and many other need to discuss and describe nanomaterials accurately

UDS CoverSo what can be done to describe nanomaterials in a way that meets the needs of the diversity of scientific disciplines, user communities, and nanomaterials themselves?  During the last three years, a CODATA-VAMAS Working Group (WG) on Nanomaterials has worked on a multi-disciplinary, multi-user community, international basis to develop a Uniform Description System for Materials on the Nanoscale (UDS).  Version 1.0 of the UDS has recently been released and is publicly available for use, comment, and downloading at www.codata.org/nanomaterials.

The UDS has two primarily goals. The first is to be able to describe a nanomaterial uniquely, so that it is differentiated from all other nanomaterials.  The second is to be able to determine two instances of a nanomaterial are equivalent (the same) to whatever degree desired.  The UDS was designed to meet both goals as well as meet the needs of different disciplines and user communities for as many types of nanomaterials as possible.  The CODATA-VAMAS WG built upon work of many groups including standards committees, database developers, and nanomaterials researchers.

What’s next for the UDS?  The most important next step is to see how the UDS compares with other schemes presently used to build databases of nanomaterials properties.  This involves workshops bringing databases builders together with the CODATA-VAMAS team to compare approaches.  For further information, contact John Rumble at rumble@udsnano.org.

Science Journalists Learn about the Data Revolution for the Sustainable Development Goals

AdSThis post comes from Alex de Sherbinin, Chair, CODATA Task Group on Global Roads Data Development and Associate Director for Science Applications, CIESIN, Columbia University

On 16-18 February I had the opportunity to join a distinguished group of science journalists for the 2nd Kavli Forum, organized by the World Federation of Science Journalists (WFSJ). WSFJ director Damien Chalaud requested that CODATA attend the workshop, and in discussions with Damien, his colleague Veronique Morin, and Lee Holtz, science editor for the Wall Street Journal, we outlined the contours of a talk. More on that below. But first a bit of background.

The meeting was organized in order to engage science journalists from major outlets such as the BBC, Science, Nature, Scientific American, National Public Radio, and major US network news outlets, in the broad topic of data journalism. Talks by journalists, computer scientists and researchers focused on tools that are available to journalists to be able to reduce the huge volumes of information available to them and to analyze data on their own. Example tools include Metro Maps, a tool designed to reduce complex stories into interlinked story lines, and the Overview Project, a content analysis tool designed to help journalists sift through mountains of electronic documents looking for story leads (e.g., the NSA documents leaked by Edward Snowden). Others introduced terms such as “geojournalism”, using online mapping and data analysis tools to tell the story of environmental change, and computational journalism, using computer programing to uncover stories.

WFSJ LOGOThe range of stories that had been uncovered, or at least told better, through data journalism was impressive. Stanford professor and journalist Cheryl Philips, described using publicly accessible records of infrastructure assessments done by the department of transportation in Washington state (USA) to map the most vulnerable bridges and to tell the story behind a bridge that collapsed, killing several people. John Bohannon of Science Magazine used iPython coding to send a fake journal article to close to 200 open access journals in a sting operation to uncover the lack of peer review of a clearly flawed article.

I was given the distinct honor of being the keynote speaker at the opening dinner. I used the Sustainable Development Goals (SDGs) of the United Nations’ post-2015 development agenda as a foundation upon which to build an argument on the importance of the Data Revolution for sustainable development. CIESIN has been involved in the effort to compute the price tag for monitoring the goals as a contribution to the Sustainable Development Solutions Network, so we have had a front-row seat in assessing the data needs.

The data revolution can be characterized as having two main elements: open data and big data. To build the case for open data, I described a few cases where environmental monitoring and data networks were either insufficient or were in danger of falling apart owing to lack of funds and inattention, including two water examples: the river gauge network of the Global Runoff Data Center (GRDC) and the UNEP-GEMS station-level water quality monitoring network. I pointed out that even in the case of air quality, which increasingly can be monitored from space, there is a need for ground validation based on in situ monitoring networks. I also described the benefits of open government data, and how such data has been found to stimulate economic growth and generate greater tax revenues than old school approaches of selling data.

DataPopAlliance-LogoI then turned to the promise and limitations of big data, aided by a useful primer by Emmanuel Letouzé of the DataPop Alliance. My central argument was that big data – defined by Letouzé as data emanating from our increasing use of digital devices, crowd sourcing or from online transactions, together with increasing computational sophistication and a community of analysts – has tremendous promise, but can never hope to fully supplant well-funded and well-functioning traditional data gathering systems such as census bureaus, national statistical offices, and environmental monitoring networks.

The discussion afterwards explored these issues and also enabled me to provide some data pointers for the journalists as they seek to employ data in their professional duties.