Reflections on Data Science and the role of the CODATA Data Science Journal from IDW and SciDataCon 2025, Brisbane, 15 October 2025, by Gita Yadav, DSJ Editorial Board, Matthew Mayernik, DSJ Editor-in-Chief, and Mark Parsons, past DSJ Editor-in-Chief.
What is the evolving role of data scientists in ensuring that automation and AI serve the long-term goals of science, integrity, and openness?
This question shaped a lively and deeply reflective session at the 2025 International Data Week (IDW) in Brisbane, Australia, organised by members of the CODATA Data Science Journal (DSJ) editorial board together with partners from CODATA, the World Data System (WDS), and the International Science Council (ISC).
What made this session particularly rewarding was the high level of engagement from both speakers and audience participants, who actively examined practical pathways to support a new generation of research-aware, ethically grounded, and infrastructure-integrated data scientists. Their discussions demonstrated that the evolution of data science is not only a technical or institutional shift, but also a shared community project; one that redefines how knowledge, responsibility, and innovation intersect in the age of intelligent automation.
The session explored how data scientists must adapt to ethical, infrastructural, and community-centric expectations, and how organisations such as CODATA, WDS, ISC, and the Research Data Alliance (RDA) can collectively guide this evolution. Speakers reflected on the historical roots of data science while identifying forward-looking strategies for building capacity, governance, and trust in the expanding data ecosystem.
Background: Data Science as a Field in Motion!
The Data Science Journal was launched by CODATA in 2002 as a peer-reviewed venue to publish, share, and preserve knowledge on data-focused topics. As far as can be determined, it was the first scholarly journal to include the term “data science” in its title. In a retrospective essay, founding editor F. Jack Smith reflected that the title was initially contentious, with some members of the CODATA Publications Committee concerned that “data science” might be misunderstood. Ultimately, the committee agreed that “it was up to CODATA to ensure that it became understood” (Smith, 2023).
Over two decades later, that mission remains both prophetic and relevant. Data science has evolved dramatically, becoming central to academic research, policy, and industrial practice. The term itself has multiplied in meaning, encompassing everything from machine learning and analytics to data curation, visualization, and ethics. Today, as the volume, variety, and velocity of data continue to expand, the field faces another inflection point, while data veracity has become a critical fourth “V” in the landscape.
In an era of intelligent automation, large language models (LLMs) and ubiquitous sensing, data science has become not only a technical discipline but also a pillar of policy, infrastructure, and ethics. Yet, as the IDW2025 meeting made clear, the scientific community places distinct demands on data science, emphasizing provenance, traceability, transparency, and FAIR principles that are not always prioritized in commercial or governmental applications.
This tension raises an important question: What does it now mean to be a data scientist working in and for science, rather than for profit or production?
Mark Parson’s talk elucidated this tension very effectively, and Figure 1 is a screenshot of his half century plot depicting how the data ecosystem has continuously evolved from the establishment of the World Data Centres (1952) to the launch of the Data Science Journal (2002) and the creation of the Research Data Alliance (2013). Alongside these milestones, Mark also placed linguistic trends that reveal the rise of data science and data management and the gradual decline of information science, signalling a profound conceptual and cultural shift.
Figure 1: Tracing the Arc of Data Science (1952–2022): This historical trajectory by Mark Parsons in his opening talk at the session illustrates how scientific data work has grown from information handling to knowledge stewardship. The Data Science Journal emerged as both a marker and a driver of this shift, grounding new technical vocabularies in the ethics and infrastructures of open science.
Insights from the Session
Evolution of Roles
Over the past three decades, data work has evolved from isolated roles (data managers, analysts, curators) into more integrated and interdisciplinary practice. In the 1990s, data management was often seen as a technical support activity. In the 2000s, as open data initiatives gained momentum, the need for stewardship became clearer: data were of little use unless structured, documented, and discoverable.
The 2010s witnessed large-scale infrastructure investments for sharing and integrating heterogeneous data. Now, in the 2020s, this infrastructure is being reconfigured in light of new AI/ML capabilities, emerging data governance policies, and community-driven data movements. Gitanjali Yadav highlighted this evolution in her talk, emphasizing that scientific data work is inseparable from issues of responsibility and provenance (Figure 2). The emerging concept of semantic data science extends this arc, emphasizing multilingual, context-aware, and ethically informed practices that keep the human in the loop even as automation advances.
Across all these phases, a recurring insight stands out: Automation can process data, but only humans can interpret meaning.
Figure 2: The Expanding Role of the Data Scientist by Gitanjali Yadav, reflecting on how the data scientist’s identity has evolved, from analyst (focused on computation and inference), to architect (building systems and workflows), to steward of integrity (ensuring ethics, transparency, and trust), making the present day data scientist a custodian of integrity and a community bridge.
Preservation and Provenance
A major theme during the session was preservation, not just of data, but of the expertise and communities that sustain it. Many repositories worldwide, especially smaller or domain-specific ones, face an existential crisis as funding cycles end or institutional support wanes. The panel emphasized that ownership and stewardship for such repositories should be distributed, balancing control with access, and ensuring that knowledge of the systems themselves is not lost.
Scientific data work depends on formal trust: verification through unambiguous reference to persistently accessible datasets. Provenance graphs, representing relationships between datasets, people, organizations, instruments, and software are central to this trust. They enable reproducibility, attribution, and understanding of how knowledge is constructed. Examples such as the #SemanticClimate initiative were cited (see references), showing how semantic annotation and linked data frameworks can connect disparate data sources and narratives in ethically transparent ways. Presented on behalf of Dr. Debasisa Mohanty, member of the CODATA National Committee in India, Figure 3 is a case study from India demonstrating how large-scale, ethically governed data ecosystems can transform biomedical science. This example underscored the dual role of data scientists as technical innovators and ethical stewards of sensitive information (Figure 3).
Figure 3. AI-Guided Genomic Discovery in India: Screenshot of the Genome India project slide presented by Gitanjali Yadav on behalf of Debasisa Mohanty, depicting AI/ML-guided exploration of genomic diversity across India, discovering over 27 million rare variants, identifying population-specific drug-response markers, and constructing a genome-wide imputation panel that improves precision for Indian genotypes.
Ethics, Equity, and Access
Data derived from people raises enduring tensions between the ideal of openness and the necessity of privacy. Participants discussed methods such as anonymization and statistical obfuscation, which allow analysis without reidentification. A broader consensus emerged that equity precedes ethics: before one can speak of fairness or accountability in AI, data access itself must be equitable. Otherwise, ethical deliberations risk reinforcing existing asymmetries in who gets to produce, own, and use data.
Invisible Labour and Recognition
Several participants highlighted the invisible labour that underpins open science, particularly curation, cleaning, and metadata preparation. This work remains undervalued in traditional reward systems that prioritize publications and citations. As one discussant noted, “Everyone celebrates analysis, but few remember who made the data analyzable.”
This invisibility also shapes the status of data professionals. Data scientists and stewards are often told, “you’re just the data person,” despite their central role in making science interoperable and reproducible. The Data Science Journal and CODATA recognise that much of the world’s data ecosystem is sustained by invisible or undervalued labour from data curation, annotation, and metadata stewardship to community-driven capacity building and open infrastructure maintenance. These contributions are essential to the integrity and longevity of scientific data but often fall outside traditional academic reward systems.
Moving forward, the discussion ranged around how the Data Science Journal can address these issues, for instance by:
- 
Encouraging submissions that explicitly acknowledge data stewardship, curation, and technical maintenance as forms of scholarly contribution.
- 
Promoting author contribution statements and citation practices that make invisible labour visible.
- 
Partnering with CODATA and WDS initiatives to explore frameworks for credit attribution and recognition of data professionals and infrastructure teams.
- 
Supporting cross-sector discussions on how open data infrastructures can better align with equity, inclusion, and fair recognition principles.
By doing so, the Data Science Journal aims to model a more just and transparent data culture, one that values all contributors to the data lifecycle, not only those visible in authorship lines or algorithmic outputs.
Rethinking What “Data Science” Means
There was broad agreement that the term data science has become so expansive as to risk losing coherence. It now covers technical, social, and managerial dimensions. The panel argued for a renewed synthesis: data stewardship needs to become more data sciency, integrating computational and AI techniques, while data science must become more stewardship-aware, incorporating principles of provenance, ethics, and transparency.
Participants also noted the value of philosophical perspectives — as raised by a reference to Naomi Oreskes’ Why Trust Science? — in understanding how we know what we know. This epistemological reflection is increasingly vital as AI systems are trained on limited, often opaque, corpora of data.
Training as ‘Reciprocal Learning’
The conversation repeatedly returned to the challenge of training and upskilling. Many organizations are hiring data stewards from disciplinary backgrounds rather than data or information science programmes, and vice versa. The result is uneven literacy across both communities.
Some argued it is easier to train scientists in data management than to train data managers in domain science; others disagreed. What was clear is that curricula for both data science and data stewardship could benefit from closer integration, potentially as a future CODATA initiative and that going forwards, we should use the term ‘reciprocal learning’ instead of ‘capacity building’. Embedding data stewards within research teams was highlighted as a successful model: it strengthens stewardship within science and builds new skills within the stewardship community itself.
In a forward looking stance, the Data Science Journal editors and CODATA executive team in the room, discussed with audience how we could commit to advancing visibility and equity in data science by championing reciprocal learning as a guiding principle for global data collaboration, moving beyond the traditional notion of “capacity building,” which can imply a one-sided transfer of knowledge. Instead, we emphasise mutual exchange, contextual expertise, and co-creation between diverse data communities. We can also encourage publications and dialogues that highlight the human and infrastructural labour underpinning scientific data.
By embracing both the recognition of invisible labour and the ethos of reciprocal learning, the Data Science Journal seeks to nurture a more just, inclusive, and reflexive data culture, one that recognises data science as a community of practice rather than a hierarchy of expertise.
Global Parallels and Technological Impacts
While automation and AI are disrupting employment patterns worldwide, the panel noted that the challenges and opportunities faced by data professionals are strikingly similar across countries. Early-career researchers tend to adopt AI tools faster, while mid-career professionals face greater displacement risks. As history shows, new technologies often reinforce existing power structures, making it even more important to foreground inclusivity and human judgment in the automation age.
Data Scientists as Connectors
A unifying theme across the session was that data scientists serve as connectors, linking disciplines, infrastructures, and communities. They mediate between automated systems and human interpretation, ensuring that AI augments rather than replaces expertise.
This connective role is both technical and social: it requires understanding algorithms, policies, and people. As automation accelerates, these skills, including empathy, interdisciplinarity, ethical awareness, will define the next generation of scientific data professionals.
Concluding Reflections
The panel concluded that the evolution of data science cannot be understood purely through the lens of tools or technologies. It must also be seen as a cultural and institutional transformation in how science organizes, values, and preserves its knowledge.
As automation, AI, and semantic infrastructures reshape the research landscape, the task of data scientists, and the mission of the CODATA Data Science Journal, remains clear: we aim to uphold transparency, equity, and openness in the systems that increasingly govern knowledge itself.
References
- 
Smith, F. J. (2023). The Launch of the Data Science Journal in 2002. Data Science Journal, 22, 11. https://doi.org/10.5334/dsj-2023-11
- The #SemanticClimate Initiative: A global citizen science movement for extracting knowledge form locked literature (https://semanticclimate.github.io/p/en/)
—–About the Session——
Session Title: Evolving Roles for Data Scientists in the Age of Intelligent Automation
Event: International Data Week (IDW) 2025 – SciDataCon 2025, Brisbane, Australia
Date: 24–27 October 2025
Organizers: Data Science Journal Editorial Board (CODATA, ISC, WDS)
Moderator: Dr. Gita Yadav (National Institute of Plant Genome Research, India)
Speakers:
- Matt Mayernik – NSF National Center for Atmospheric Research, USA
- Mark Parsons – Research Data Alliance / Arctic Data Community
- Debasisa Mohanty – Director National Institute of Immunology, India (presentation by GY)
- Gitanjali Yadav, National Institute of Plant Genome Research (NIPGR) India and #semanticClimate
Affiliated Organisations: CODATA, World Data System (WDS), International Science Council (ISC), Research Data Alliance (RDA)
 Session Link: https://scidatacon.org/event/9/contributions/40/
















