Provenance

Reliable and rich provenance and processing information is essential to Interoperability and Reusability. PROV-O (the W3C provenance ontology) provides a basis, but is very general and more detailed and specific implementations are needed (DDI-CDI features such an implementation which will need further refinement).

Two conference workshops have been held to survey technical implementations (FAIR Convergence Symposium, Dec 2020) and the approach to provenance taken in a range of domains (RDA P17, April 2021).

Discussions are underway about next steps, with a view to exploring provenance issues and implementations in various domains / case studies and ultimately leading to a more sophisticated cross-domain implementation of PROV-O (i.e. an implementation that can carry more information but remains sufficiently generic to be used as a basis for provenance in different domains). The draft proposal is reproduced below, but it has not yet (Nov 2021) been possible to take this forward.

Proposal for a General Guideline for Documenting Process, Provenance, and Workflow

Background

Several interesting approaches have been presented and discussed in recent months, and that discussion continues. This document briefly outlines the current state of play and proposes a further step towards outlining a practical approach to the documentation and sharing of provenance information in line with the presentations and discussions which have taken place.

Practices in some domains are rapidly evolving, and it will be necessary to build on these. There is also growing demand for sharing data and provenance information across domains. The proposed effort will primarily be concerned with the exchange of such information between and among domains. This document outlines a starting point for setting that work in motion, by providing a reference model to support further discussion.
Perspectives on Provenance and Process Description

There seem to be two related but distinct threads to this discussion thus far:

Researcher perspective: Some approaches are concerned with spanning those activities which involve the realization of the input data for an experiment/analysis, and the workflow from that point, to the analysis outputs to support findings. WholeTale gives us an example of this in its “recorded runs,” and other solutions also provide a mechanism for describing this part of the research process. Once documented, this kind of provenance is an essential building block in understanding the data, the processes, the experimental context, and the analysis.
Archival perspective: Infrastructure players are often involved in the management of data and process descriptions after the research itself is complete. Providing data and other resources for reuse also involves the processing of those resources, and these processes can also be significant. Further, it is often a requirement at this point to have a broader view: data can be collected and used in research, producing data which was then re-used in research, and so on. Provenance chains go beyond the scope of a single experiment or analysis. In some cases, data and methods are not reusable for reasons of non-disclosure and may even be changed to support reuse (e.g., public use versions of data). This longer provenance chain also must be documented, even in those cases where processes are not research processes but are conducted for other purposes. Unlike the description of a single analysis, it is very likely that such provenance chains will involve many different organizations and activities, spread across time. Provenance description must therefore be in a form which can be extended as the provenance chain grows.

Building on Existing Work towards a Practical Recommendation

Even without looking forward to the use of processes and workflow descriptions as repeatable templates for automating systems, it makes sense that both of these perspectives be supported by a single, overarching model. Many pieces of such a model have been developed in different projects. Many of them assume implementations of the W3C PROV ontology. Most (but not all) are developed from the perspective of a specific domain or style of research.

Specifically, we have the ability to describe specific functions (e.g., SDTL, VTL), to record workflows and computing environments, the interactions of processes with data and other resources (i.e., PROV ONE, DDI-CDI), and schemes for recognizing the variation across domains (also based on PROV-O).
What is lacking is an examination of these components to identify a single agreed approach which could potentially be shared between domains. Such a recommendation would likely take the form of a guideline and might usefully be focused on the baseline established by the PROV-O specification.

Conceptually, this would be a high-level generic model of research, in essence, suitable for specialization according to domains. While PROV-O gives us the needed model for describing any provenance whatsoever, we might envision a still-generic use of that model to describe research as an activity, which would bring us one step closer to being able to understand processes across domain boundaries. It would be, in essence, a reference model, and likely result in something similar to PROV-ONE but supported by other methods for exchanging provenance descriptions generically.

Methods for the specialization of that generic model to describe activities within a specific domain have already been developed. Further, the compilation of provenance chains – even those which span domains – is possible using an approach such as the one proposed in the DDI-CDI work. The proposed task would be to assemble these existing pieces, and to see if an agreed approach for their coordinated use could be developed. The existence of such guidance, even if only as a reference point, could help to push forward the emerging picture of how process and provenance can be usefully described to support FAIR sharing.

Next Steps

The next steps would be to convene a relatively small group, based on constructive contributions to the Birds of a Feather sessions held at the RDA Plenary, to get their feedback on this suggestion and work towards an agreement to take it forward.

In November 2021, this work is currently on hold, pending sufficient effort to prioritise it.