ISI-CODATA Big Data Workshop as Word Clouds

Shiva_KhanalThis post was written by Shiva KhanalResearch Officer with the Department of Forest Research and Survey in Nepal.  Shiva was one of the international scholars sponsored by CODATA to attend the ISI-CODATA International Training Workshop on Big Data.

This March two associated events were co-organized by the Committee on Data for Science and Technology (CODATA) and the Indian Statistical Institute (ISI): the International Seminar on Data Science (19-20, Mar 2015)  and ISI CODATA International Training Workshop on Big Data (9-18, Mar 2015). Those events in ISI Bangalore, India, covered a wide range of talks and presentations related to big data with presenters from diverse background such as academic community, business sectors and data scientists.

One way to visualize the focus of the program would be to make plot of terms that were more frequent. I obtained the schedule of presentations and tutorials during seminar (http://drtc1.isibang.ac.in/datascience/schedule.html) and training workshop (http://drtc1.isibang.ac.in/bdworkshop/schedule.html) and generated a word cloud using the R package – wordcloud.

The R code along with the dropbox link to the data is provided. Pasting this on R console will give the word cloud shown here:

isi_codata_word_cloud

Here is the code:

###########################################

#load packages

library(wordcloud)

library(tm)

#read the presentation details

textf = readLines(“http://dl.dropboxusercontent.com/u/111213395/text_file_presentation_titles.txt”)

# get a column of strings.

text_corpus <- Corpus(VectorSource(textf))

# create document term matrix and apply transformations

tdm = TermDocumentMatrix(text_corpus,

      control = list(removePunctuation = TRUE, stripWhitespace=TRUE,

                     stopwords = c(stopwords()),PlainTextDocument =TRUE, removeNumbers = TRUE, tolower = TRUE))

m <- as.matrix(tdm)

v <- sort(rowSums(m),decreasing=TRUE)

d <- data.frame(word = names(v),freq=v)

pal <- brewer.pal(6,”Dark2″)

pal <- pal[-(1)]

#plot the word cloud and save as png

png(“test.png”,width=3.25,height=3.25,units=”in”,res=1200)

wordcloud(d$word,d$freq,c(4,.3),2,,TRUE,TRUE,.15,pal)

dev.off()

###########################################

I also created a word cloud from twitter using #isibigdata (total ~100 tweets). Unlike, the  text based word cloud above, twitter extraction required little bit of customization (setting up credentials for a twitteR session), but otherwise the plotting codes are almost the same).

isibigdata tweet cloud