Author Archives: codata_blog

Comparing Approaches to the Description of Nanomaterials

This post comes from John Rumble, chair of the CODATA Working Group on the Description of Nanomaterials, and Egon Willighagen, of Maastricht University and the eNanoMapper Project.

UDS CoverThe CODATA Working Group (WG) on Nanomaterials held a small workshop with members of the eNanoMapper (http://www.enanomapper.net/)  team at Maastricht University in the Netherlands, on 13-14 July, 2015. The purpose of the meeting was to clarify and compare approaches to the description of nanomaterials, specifically the  CODATA WG’s Uniform Description System for Materials on the Nanoscale  (UDS) (http://dx.doi.org/10.5281/zenodo.20688) and the eNanoMapper Ontology, as  described in Journal of Biomedical Semantics (2015) (http://dx.doi.org/10.1186/s13326-015-0005-5).

At the workshop, all major UDS information categories, subcategories, and descriptors were reviewed in detailed and compared to the eNanoMapper ontology. Several issues were identified as requiring revision, including better description of measurement methods used to characterize aspects such as size, shape, and size distribution, the role of topology, and differentiating among different types of nanomaterial stability. The results of this workshop will be combined with recommendations made at a similar workshop held at the Universities at Shady Grove in the United States on10-11 June 2015, to guide revision of version 1.0 of the UDS. The next version should be available in early September and will include downloadable tables and sources of definitions, when applicable.

s13326-015-0005-5-2Immediately following the Maastricht workshop, CODATA WG representatives met at the Quality Nano Conference held in Crete. Of particular interest was the description of coatings and coronas for individual nano-objects, especially with respect to their reproducibility (or randomness) and predictability.

For further information about the CODATA Nanomaterials WG, please contact John Rumble at rumble@udsnano.org. For further information about eNanoMapper, please contact Egon Willighagen at egon.willighagen@gmail.com.

Climate Change is everyone’s responsability: we need community initiatives

This post is by Elizabeth Griffin, chair of the CODATA Data at Risk Task Group, and co-chair of the related RDA Interest Group on Data Rescue.

ocfcc_paris_2015An international conference on Our Common Future under Climate Change in Paris? A remarkable choice of venue as it turned out: climate change was pronounced, the temperature swinging from 39ºC to 21ºC faster than you could keep track, while the freshness and friskiness of the wind suggested autumn in early July. Longer-term aspects of the changes, those insidious underlying trends that we only perceive by reference to accurate past data, demanded the busy attention of some 2000 delegates from all over, their specialities and contributions disaggregated into some 165 parallel sessions classified according to their bearings on data, interpretations, or solutions.

Gems were admixed with frustrations. The jewelry to be gained by connecting disparate strands was tarnished by their division into so many parallel sessions that some deserving nuggets had to glister unseen, limiting the effectiveness of each message. The delegates had messages enough – the list of session titles and contents were ample witness to that – but giving each space by subdividing the sessions where no division was obvious or suitable did not serve the speakers well. Neither did dividing the conference physically between three locations that took 15-20 minutes of healthy street walking or less-healthy metro riding – presumably inevitable for reasons of available space, but not ideal from the aspect of the conference. Any venture outside was as good as a continent away, and must have deterred many.

What did we hear that was novel, unexpected, heart-warming or frightening? Plenty in all categories. The evidence of climate change, particularly in the polar regions, is incontestable to any but the deliberately blind, though interpretations are too model-dependent to represent the cast-iron situation that politicians seem to need. Heart-warming were the intimate projects, evidenced in numerous posters, describing heroic local efforts to challenge what the world now threatens them with. Even gender seemed to play a convincing role in countries like Bangladesh, where women have little control over the deployment of resources.

But the many theoretical solutions for “reducing carbon emissions” through science-based policies and managed economies were neither novel nor unexpected, and none tackled adequately the fundamental question: Who or what can be defined as responsible for actually driving AND SUSTAINING climate change? If the correct answer is (as I interjected) Everyone, then that is where solutions must begin. But why did the solution thus stated appear so radical that it merited recording for use in students’ summer school on climate change? Are the emperor’s new clothes really so hard to see?

Transition Town Totnes

How can that message get to where it is certainly needed but still blithely ignored? Some hypocrisy was on evidence at the conference: lunch came in modular plastic cups destined for a landfill, and the registration fee included items we neither needed nor wanted. Is the future something that “civilized” humanity would rather give into than face? Is all this just a theoretical exercise, always someone else’s problem?

Transition Town Totnes

Time is running out. Tipping-point is approaching. Tomorrow is nearly here. But there is hope – the psychology of Community can break the vicious circle when moral issues are introduced into the solution, such as in Transition Towns like Totnes (UK) and mini-eco communes (e.g., in Cornwall, UK). The nuggets of excellence and example created by local communities will become the jewels of our own future provided that (a) we manage them effectively, and (b) we start acting now.

CODATA Recommended Values of the Fundamental Physical Constants: 2014

This post comes from David Newell, Chair of the CODATA Task Group on Fundamental Physical Constants.  See also:

The compilation of the 2014 self-consistent set of values of the basic constants and conversion factors of physics and chemistry recommended by the Committee on Data for Science and Technology (CODATA) for international use has recently been completed by its Task Group on Fundamental Constants (TGFC). The new values are available from ArXiv arXiv:1507.07956v1, from the CODATA Zenodo Collection http://dx.doi.org/10.5281/zenodo.22826 and from the NIST website on ‘Constants’ Units and Uncertainty’ physics.nist.gov/constants.  They are based on a multi-variant least-squares adjustment that takes into account all data available through 31 December 2014.

Table 1-CODATA Recommended Values of Fundamental Physical Constants 2014

As a working principle, the validity of the physical theory underlying the adjustment is assumed. This includes special relativity, quantum mechanics, quantum electrodynamics (QED), the standard model of particle physics, including CPT invariance, and the exactness of the relationships between the Josephson and von Klitzing constants KJ and RK and the Planck constant h and elementary charge e. Although the possible time variation of the constants continues to be an active field of both experimental and theoretical research, to date there has been no confirmed observation of a variation relevant to the data on which the 2014 recommended values are based.

2014 Values: Reduction in Uncertainty

A significant number of new results became available for consideration, both experimental and theoretical, from 1 January 2010, after the closing date of the 2010 adjustment, through 31 December 2014, the closing date of the 2014 adjustment. Overall, the recommended values fundamental constants have become more accurate, i.e., their uncertainties have decreased, due to improved measurements and theoretical calculations of the measurable quantities that depend on the constants. For example, compared to the CODATA 2010 recommended values,

  • The 2014 value of the relative mass of the electron Ar(e) has a relative uncertainty of 2.9 × 10‑11, a reduction in uncertainty by a factor of 14.
  • The 2014 value of the fine-structure constant a has a relative uncertainty of 2.3 × 10‑10, a reduction by a factor of 1.4.
  • The 2014 value of the Newtonian constant of gravitation G has a relative uncertainty of 4.7 × 10‑5, a reduction by a factor of 4.7.
  • The 2014 value of the Planck constant h has a relative uncertainty of 1.2 × 10‑8, a reduction by a factor of 3.7.
  • The 2014 value of the Boltzmann constant k has a relative uncertainty of 5.7 × 10‑7, a reduction by a factor of 1.6.

Challenges with Input Data for the Proton RMS Electric Charge Radius rp

As is the case with all CODATA adjustments of the fundamental physical constants, a major challenge is the treatment of discrepant input data. For the 2014 adjustment the input data for the proton root-mean-square (rms) electric charge radius rp remained a puzzle. The value of rp from muonic hydrogen experiments has a smaller uncertainty by more than a factor of 10 than that from a combined value of rp from electron-proton scattering data and hydrogen (H) and deuterium (D) spectroscopy, yet differs by over a factor of five times the uncertainty of their difference, or “5σ”. To address this discrepancy, the TGFC invited the principle investigators and experts involved with the experiments and theory related to muonic hydrogen, electron-proton scattering, and H and D spectroscopy to its 2014 meeting held 3-4 November 2014 at the International Bureau of Weights and Measures (BIPM). Based on the advice of these experts, it was decided not to include the muonic hydrogen results in the 2014 adjustment (see details and minutes of the November 2014 meeting). To help address other input-data issues, CODATA co-organized a workshop 1-6 February 2015 in Eltville, Germany where various issues with the analysis of the electron-proton scattering data as well as with some of the acoustic gas thermometry data were resolved (see https://indico.gsi.de/conferenceDisplay.py?confId=2742).

2018 CODATA Recommended Values and the New SI

CODATA is presently preparing for its major role in a significant revision of the International System of Units (SI) scheduled for adoption in the fourth quarter of 2018. This “New SI” will be based on exact numerical values for h, e , the Boltzmann constant k, and the Avogadro constant NA (for more information, see: http://www.bipm.org/en/measurement-units/new-si/).   In 2011 at its 24th meeting the General Conference on Weights and Measures (CGPM) invited CODATA to continue to provide least-squares adjusted, recommended values of the fundamental constants, h, e, k, and NA in particular, since these values will be those used for the revised SI. Because of the good progress made in both experiment and theory since the 31 December 2010 closing date of the 2010 CODATA adjustment, the uncertainties of the 2014 recommended values of h, e, k and NA are already at the level required for the adoption of the revised SI by the 26th CGPM in the fall of 2018. The formal road map to redefinition includes a special CODATA adjustment of the fundamental constants with a closing date for new data of 1 July 2017 in order to determine the exact numerical values of h, e, k, and NA that will be used to define the New SI. A second CODATA adjustment with a closing date of 1 July 2018 will be carried out so that a complete set of recommended values consistent with the New SI will be available when it is formally adopted by the 26th CGPM. Ordinarily the closing date for the regularly scheduled CODATA adjustment carried out every four years―in this case the 2018 CODATA adjustment―would have been 31 December 2018. However, the normal date has been advanced by six months so that the 2018 set of CODATA recommended values will not only be consistent with the New SI, but ready for all to use at the exact same time the New SI becomes a reality.

Mark Thorley Appointed to the Committee on Freedom and Responsibility in the Conduct of Science

DSC_1777This post is by Mark Thorley, a member of the CODATA Executive Committee and Chair of the CODATA Data Policy Committee.

It is an honour and a privilege to be invited to become a member of the International Council for Science’s Committee on Freedom and Responsibility in the Conduct of Science (CFRS).

The CFRS serves as the guardian of the ICSU Principle of the Universality of Science and undertakes a variety of actions to defend scientific freedoms and to promote integrity and responsibility in the conduct of science. The universality of science in its broadest sense is about developing a truly global scientific community on the basis of equity and non-discrimination. It is also about ensuring that science is trusted and valued by societies across the world. As such, it incorporates issues related to the conduct of science; capacity building; science education and literacy; the relationship between science and society; and crucially, equitable access to data and information and other resources for research.

My contribution to the work of the CFRS will be built on my broad experience and expertise in open access, research data management and related policy development and implementation. Key to this is my understanding of the landscape of open access and research data, from the technical through to the strategic, and the resulting implications for researchers and research organisations.

The provision of open access to scientific results is an important part of the responsibility of science in the modern age, as both a hedge against scientific fraud and for developing science as a public good. Open research data are both a pre-requisite of supporting open and transparent research, and also an opportunity for new areas of research and innovation. I am keen that through the work of the CFRS and CODATA we can develop robust and pragmatic guidance for the research community on how to ensure that research data are wherever possible made openly available for use by others in a manner consistent with relevant legal, ethical and regulatory frameworks and norms.

icsu-logoI have been working in research data management since 1990 and in open access policy development and implementation since 2006, and am currently Head of Science Information for the UK’s Natural Environment Research Council. I am one of the Research Council UK policy leads in Open Access and scholarly communications, where I have been prominent in the development and implementation of RCUK’s Open Access policy. I am a member of the advisory board for the Nature Publishing Group journal ‘Scientific Data’, and was one of the experts who contributed to the recent ICSU statement on Open Access to scientific data and literature and the assessment of research by metrics. I also helped develop the OECD’s Principles and Guidelines for Access to Research Data from Public Funding.

I am also a member of the Executive Committee of CODATA, and I see this appointment very much as recognition of the work that myself and colleagues within CODATA have done to develop the agenda of open research data.

My appointment to the CFRS is for three years from this October.

More information about the work for the CFRS is available from the ICSU website (http://www.icsu.org/freedom-responsibility/cfrs).

Preparation Underway of the Second Edition of the Atlas of the Earth’s Magnetic Field

Alena Rybkina

This post is by Alena Rybkina, a member of the CODATA Executive Committee, of the CODATA Early Career Data Professionals Group, and a participant in the Task Group on Earth and Space Science Data Interoperability.

The ‘Task Group on Earth and Space Science Data Interoperability’ (TG-ESSDI) published an electronic version of the first edition of the Atlas of the Earth’s Magnetic Field in 2013. It includes a unified set of physical, geographic, thematic, and historical materials for a detailed study of the geomagnetic field from 1500 to 2010. The Atlas is intended for a wide range of scientists, teachers, students and experts in applied areas relating to the geosciences, including geologists and geophysicists studying geomagnetism. The Atlas is a unique cartographic product that contains comprehensive and scientifically grounded characteristics of geomagnetic phenomenon, and contains the results of historical and modern studies of the Earth’s magnetic field.

Atlas of Earth's Magnetic Field, First Edition, 2013In May 2015, with support from CODATA and hosted by ICSU, TG-ESSDI organised an international workshop on ‘The Atlas of the Earth’s Magnetic Field. Second Edition’. The meeting was designed as a launch event for the project to create the second edition of The Atlas, with the aim of significantly extending the content of The Atlas. The list of participants included specialists in the geomagnetic studies with representatives from the Institut de Physique du Globe de Paris, the Commission for the Geological Map of the World (CGMW), the International Association of Geomagnetism and Aeronomy (IAGA), and the International Union of Geodesy and Geophysics (IUGG) etc.

The workshop resulted in a document that is expected to become the base for the content of the Second Edition. It was agreed to include following new or extended chapters:

  • Current knowledge of the magnetic field. Main field. From ~8000 to 2020.
  • SWARM data (satellite was launched by ESA in November 2013)
  • Regional scale maps (including Arctic and Antarctic maps)
  • Magnetic fields of the Solar System: Sun, Mars and other planets
  • Applications of the magnetic field data (drilling, navigation, GPS, dykes etc)

Anomalies of the Earth`s crustA further meeting organised to coincide with the 26th General Assembly of the International Union of Geodesy and Geophysics (IUGG), which was held from 22 June to 2 July 2015 in Prague, Czech Republic. The conference was characterised by the central theme: ‘Earth and Environmental Sciences for Future Generations’.  A presentation about the Second Edition of the Atlas was given by Alena Rybkina and feedback was received through subsequent discussions and working meetings organised by TG-ESSDI. As a result of these discussions, it was decided, for example, to extend the historical chapter of the Atlas and include historical charts from Spain, from the magnetic observatory of Barcelona, as well as from South Africa, Czech Republic and other locations. Thus the Second Edition of the Atlas will extend its geographic reach and become an even more important and valuable project for the magnetic field and earth data community.

 

Data at Risk and Data Rescue

This post is by Elizabeth Griffin, chair of the CODATA Data at Risk Task Group, and – as she explains below – now co-chair of the related RDA Interest Group on Data Rescue.

codata_logoThe CODATA Data at Risk Task Group (DAR-TG) has suddenly got much larger! It has now become affiliated to the Research Data Alliance, through the formation of an RDA Interest Group for “Data Rescue”.

The combined group (known as IG-DAR-TG) shares all the same scientific principles, the same objectives (and even the same ‘language’) as its natal CODATA Task Group for “Data At Risk”; the two Groups will maintain their own identities within the merged affiliated one, but share the benefits of the two supporting organisations.

The topic of “Data Rescue” is becoming recognized as vitally important to researchers, particularly in matters of climate change and global warming. Just about every scientific study can benefit substantially from being able to access to its heritage data at some point, for some purpose.

RDA_Logotype_CMYKData Rescue involves two strands of data management:

  1. the recovery and digitization of analogue data – those too historic to have been born-digital – and;
  2. adding essential value to archives of (mostly early) electronic ones – metadata, format information, access.

Accounts of the successful recovery and upgrading or digitization of holder data brim over with the unique scientific benefits, which then improve the sort of modelling that is critical for predicting future conditions. It’s a win-win situation, so why are “Data Rescue” initiatives even necessary?  This is:

  1. because the challenges of extracting data and information from outmoded, analogue technology can be considerable;
  2. because this process generally requires important information about the platform and mode of gathering data that can be very hard to reconstruct; and,
  3. because such initiatives also need to counter a general disbelief that they even exist or could ever be scientifically useful. It is important to counter the widespread assumption that all data of value are born-digital.

We hope and plan that that new Data Rescue initiative will waken up the world to the huge potential waiting to be recovered! Please join us, either through the CODATA Task Group “Data At Risk” or the RDA branch “Data Rescue”.

Rescuing Legacy Data for Future Science

This post is by Elizabeth Griffin, chair of the CODATA Data at Risk Task Group.

EGU_LOGOEvery two years, climate scientists at Elsevier (New York) and IEDA (Integrated Earth Data Applications, Columbia University), jointly support and award the International Data Rescue Award in the Geosciences for the best project that describes the ‘rescue’ of heritage data in the context of the geosciences. The result of the competition for 2014-15 was announced at the annual meeting of the European Geosciences Union (EGU), held in April in Vienna. The strength and scope of the competition had increased significantly since the 2013 one.

A shortlist of four was announced in Vienna: with three receiving ‘honourable mention’.  The winner of the 2015 International Data Rescue Award in the Geosciences was British Macrofossils Online, a Jisc-funded project from the British Geological Survey to create a fully electronic catalogue of all the fossil collections in UK museums and similar repositories. The project team consisted of Mike Howe, Caroline Buttler, Dan Pemberton, Eliza Howlett, Tim McCormick, Simon Harris and Michela Contessi working alongside a number of other contributors.2015-International-Data-Rescue-Award

During the same ceremony, a Special Issue of an online journal, GeoResJ, was launched: it was given over entirely to descriptions of data rescue projects, and featured a six-page introductory article by the CODATA “Data At Risk” Task Group (DAR-TG) team, entitled ‘When are Old Data New Data?’.

GeoResJOpening the meeting in Vienna, Dr Elizabeth Griffin (Chair, DAR-TG) explained and illustrated the considerable scientific importance of recovering scientific information that was recorded before the electronic age, and what CODATA (through its TG) was attempting to do towards stimulating many more data recovery efforts. The visibility which the evening afforded to the DAR-TG and to CODATA itself was very valuable, the event presenting a memorable complement of rationale, endeavour and achievement. Open publicity of this nature is one of the goals of the DAR-TG; it is essential for spreading the word about undertaking and (then) coordinating efforts to bring archived, nearly lost, or almost unreadable data back into full service.

CODATA Roads Task Group at State of the Map 2015

This post comes from Alex de Sherbinin, chair of our Global Roads TG and Associate Director for Science Applications at the Center for International Earth Science Information Network (CIESIN), an environmental data and analysis center within The Earth Institute at Columbia University.

AdSThe CODATA Global Roads Data Development Task Group was well represented at the State of the Map US 2015 conference in New York City on 6-7 June. State of the Map US represents an annual confab of OpenStreetMap (OSM) mapping enthusiasts, with representatives from 41 countries present. OSM has had impressive growth in coverage and detail in the decade since its launch, and is increasingly being seen as an authoritative data source, much as wikipedia has rivaled traditional encyclopedias for content and currency. When Steve Coast, founder of OSM, joined our workshop in 2007 the promise of OSM was clear, but streets were largely mapped only in urban areas of Europe and the US. Now the map is global, though OSM still lags in some developing regions. Efforts are being made through Missing Maps and Map Give to rectify this situation. But the unfortunate reality is that it seems the best route for growing coverage in low income countries is to experience a natural or humanitarian disaster, since this focuses attention on the huge need for better transportation data in these countries.

State of the Map US 2015 was held at the United Nations building in New York City, June 6-8. Source : (c) yellowbkpk. CC-BY.2.0 license.

There were several signs of the evolution of OSM from a small coterie of mapping enthusiasts to a moving force in the mapping community. One was the conference venue: United Nations headquarters. Another was the corporations represented. A growing ecosystem of companies – Mapzen (a Samsung incubator project), Mapillary, Digital Globe, and Esri – had booths and helped sponsor the meeting, and more importantly, are building services off of OSM. According to one presenter, Google Maps and OSM helped to drop the bottom out of the digital road map market almost overnight, significantly depreciating the value of the data held by companies such as Navtech and TeleAtlas. The emphasis now is on services built on the data. A third sign of OSM’s importance was the high level representation from the US government, including the Chief Technical Officer of the Obama Administration, Meghan Smith, and the Chief Geographer of US Agency for International Development, Carrie Stokes. Department of Transportation and USGS were also represented.

3079

Presentations focused on community efforts to build the map in regions such as Fukushima, Japan, or to QA/QC maps in the US (compared to Tiger line files). There were also plenty of presentation on new “add on” tools for map digitization and services based on OSM. Mapilarity, for example, is enlisting volunteers to use their smart phones to video roads they drive on for upload to their company, where they will be converted to the equivalent of Google Streetview, but without the 360 degree coverage.

Together with Paola Kim-Blanco, I presented a lightning talk on “Validation and Assimilation of OSM Data for the Global Roads Open Access Data Set” and organized a breakout group on the same topic. The main point of our presentation was to make the case for the need for greater validation of the data in terms of spatial accuracy, attribute information, coverage, and completeness, especially in the world’s poorest regions. We illustrated this by showing data for West Africa. In terms of spatial accuracy, the OSM data are generally pretty good – in the range of 30-50m offsets from high resolution Google Earth imagery, which themselves are around 5m from “ground truth” (see Ubukawa et al. 2014). But the coverage varies widely. Comparing data for May 2014 and 2015, we found that the data for Ebola affected countries grew by 250% (or 3.5 times on average) compared to 50% for non-Ebola affected countries, but there are still large gaps in spatial coverage for both. And the greatest growth often occurred in unclassified roads – which means we don’t know if they are cart-tracks or paved primary roads. This reflects the fact that most mappers digitize from high resolution imagery and cannot always distinguish among road classes.

logo_sotmus2015Our breakout session yielded about 20 participants who were interested in this topic, and we hope to generate protocols for validation that might engage members of the OSM community. Once data have been validated, we hope to assimilate them into gROADS. Though choosing when to assimilate data may be a challenge, as growth in the network in the poorer countries still depends heavily on whether there is an organized push to collect data, or in the worst case, a disaster.

Find out more

Follow State Of The Map US on Twitter

OpenStreetMapUS website

Follow OpenStreetMapUS on Twitter

ogf4xh4nws9gyyzuf3he

GYA, CODATA-ECDP and Open Science

Marshall MaThis post comes from Xaiogang (Marshall) Ma, a core member of the CODATA Early Career Data Professionals Group (ECDP).  He was a winner of one of the inaugural World Data System Stewardship Awards at SciDataCon 2014.  Marshall is an Associate Research Scientist at Rensselaer Polytechnic Institute, specialising in Semantic eScience and Data Science.  Check out his RPI Homepage here.

During May 25-29, 2015, the Global Young Academy (GYA) held the 5th International Conference for Young Scientists and its Annual General Meeting at Montebello, Quebec, Canada. I attended the public day of the conference on May 27, as a delegate of the CODATA Early Career Data Professionals Working Group (ECDP).

The GYA was founded in 2010 and its objective is to be the voice of young scientists around the world. Members are chosen for their demonstrated excellence in scientific achievement and commitment to service. Currently there are 200 members from 58 countries, representing all major world regions. Most GYA members attended the conference at Montebello, together with about 40 guests from other institutions, including Prof.  Gordon McBean, president of the International Council for Science and Prof. Howard Alper, former co-chair of IAP: the Global Network of Science Academies.

GYA issued a position statement on Open Science in 2012, which calls for scientific results and data to be made freely available for scientists around the world, and advocates ways forward that will transform scientific research into a truly global endeavour. Dr. Sabina Leonelli from the University of Exeter, UK is one of the lead authors of the position statement, and also a lead of the GYA Open Science Working Group. A major objective of my attendance to the GYA conference was to discuss the future opportunities on collaborations between CODATA-ECDP and GYA. Besides Sabina, I also met Dr. Abdullah Tariq, another lead of the GYA Open Science WG, and several other members of the GYA executive committee.

The discussion was successful. We mentioned the possibility of an interest group in Global Open Science within CODATA, to have a few members join both organizations, to propose sessions on the diversity of conditions under which open data work around the world, perhaps for the next CODATA/RDA meeting in Paris or later meetings of the type, to collaborate around business models for data centers, and to reach out to other organizations and working groups of open data and/or open science, etc.

GYA is such an active group both formed and organized by young people. And I was so happy to see that Open Science is one of the four core activities that GYA is currently promoting. I would recommend ECDP and CODATA members to explore the details of GYA activities on their website http://www.globalyoungacademy.net and propose future collaborations to promote topics of common interest on open data and open science.GYA-FullWidth

ISI-CODATA Big Data Workshop as Word Clouds

Shiva_KhanalThis post was written by Shiva KhanalResearch Officer with the Department of Forest Research and Survey in Nepal.  Shiva was one of the international scholars sponsored by CODATA to attend the ISI-CODATA International Training Workshop on Big Data.

This March two associated events were co-organized by the Committee on Data for Science and Technology (CODATA) and the Indian Statistical Institute (ISI): the International Seminar on Data Science (19-20, Mar 2015)  and ISI CODATA International Training Workshop on Big Data (9-18, Mar 2015). Those events in ISI Bangalore, India, covered a wide range of talks and presentations related to big data with presenters from diverse background such as academic community, business sectors and data scientists.

One way to visualize the focus of the program would be to make plot of terms that were more frequent. I obtained the schedule of presentations and tutorials during seminar (http://drtc1.isibang.ac.in/datascience/schedule.html) and training workshop (http://drtc1.isibang.ac.in/bdworkshop/schedule.html) and generated a word cloud using the R package – wordcloud.

The R code along with the dropbox link to the data is provided. Pasting this on R console will give the word cloud shown here:

isi_codata_word_cloud

Here is the code:

###########################################

#load packages

library(wordcloud)

library(tm)

#read the presentation details

textf = readLines(“http://dl.dropboxusercontent.com/u/111213395/text_file_presentation_titles.txt”)

# get a column of strings.

text_corpus <- Corpus(VectorSource(textf))

# create document term matrix and apply transformations

tdm = TermDocumentMatrix(text_corpus,

      control = list(removePunctuation = TRUE, stripWhitespace=TRUE,

                     stopwords = c(stopwords()),PlainTextDocument =TRUE, removeNumbers = TRUE, tolower = TRUE))

m <- as.matrix(tdm)

v <- sort(rowSums(m),decreasing=TRUE)

d <- data.frame(word = names(v),freq=v)

pal <- brewer.pal(6,”Dark2″)

pal <- pal[-(1)]

#plot the word cloud and save as png

png(“test.png”,width=3.25,height=3.25,units=”in”,res=1200)

wordcloud(d$word,d$freq,c(4,.3),2,,TRUE,TRUE,.15,pal)

dev.off()

###########################################

I also created a word cloud from twitter using #isibigdata (total ~100 tweets). Unlike, the  text based word cloud above, twitter extraction required little bit of customization (setting up credentials for a twitteR session), but otherwise the plotting codes are almost the same).

isibigdata tweet cloud