This post was written by Shiva Khanal, Research Officer with the Department of Forest Research and Survey in Nepal. Shiva was one of the international scholars sponsored by CODATA to attend the ISI-CODATA International Training Workshop on Big Data.
This March two associated events were co-organized by the Committee on Data for Science and Technology (CODATA) and the Indian Statistical Institute (ISI): the International Seminar on Data Science (19-20, Mar 2015) and ISI CODATA International Training Workshop on Big Data (9-18, Mar 2015). Those events in ISI Bangalore, India, covered a wide range of talks and presentations related to big data with presenters from diverse background such as academic community, business sectors and data scientists.
One way to visualize the focus of the program would be to make plot of terms that were more frequent. I obtained the schedule of presentations and tutorials during seminar (http://drtc1.isibang.ac.in/datascience/schedule.html) and training workshop (http://drtc1.isibang.ac.in/bdworkshop/schedule.html) and generated a word cloud using the R package – wordcloud.
The R code along with the dropbox link to the data is provided. Pasting this on R console will give the word cloud shown here:
Here is the code:
###########################################
#load packages
library(wordcloud)
library(tm)
#read the presentation details
textf = readLines(“http://dl.dropboxusercontent.com/u/111213395/text_file_presentation_titles.txt”)
# get a column of strings.
text_corpus <- Corpus(VectorSource(textf))
# create document term matrix and apply transformations
tdm = TermDocumentMatrix(text_corpus,
control = list(removePunctuation = TRUE, stripWhitespace=TRUE,
stopwords = c(stopwords()),PlainTextDocument =TRUE, removeNumbers = TRUE, tolower = TRUE))
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(6,”Dark2″)
pal <- pal[-(1)]
#plot the word cloud and save as png
png(“test.png”,width=3.25,height=3.25,units=”in”,res=1200)
wordcloud(d$word,d$freq,c(4,.3),2,,TRUE,TRUE,.15,pal)
dev.off()
###########################################
I also created a word cloud from twitter using #isibigdata (total ~100 tweets). Unlike, the text based word cloud above, twitter extraction required little bit of customization (setting up credentials for a twitteR session), but otherwise the plotting codes are almost the same).