Provenance¶
Introduction¶
The rook processes are recording provenance information about the process execution details. This information includes:
used software and versions (
rook
,daops
, …)applied operators like
subset
andaverage
used input data and parameters (cmip6 dataset, time, area)
generated outputs (NetCDF files)
execution time (start-time and end-time)
This information is described with the W3C PROV standard and using the Python PROV Library
Overview of PROV¶
The W3C PROV Primer document gives an overview of the W3C PROV standard.
A PROV document consists of agents, activities and entities. These can be connected via PROV relations like wasDerivedFrom.
Entities¶
- W3C PROV
In PROV, physical, digital, conceptual, or other kinds of thing are called entities.
In rook we use entities for:
workflow description,
input datasets and
resulting output NetCDF files.
Activities¶
- W3C PROV
Activities are how entities come into existence and how their attributes change to become new entities, often making use of previously existing entities to achieve this.
In rook we use activities for:
operators like
subset
andaverage
.processes like
orchestrate
to run a workflow.
Agent¶
- W3C PROV
An agent takes a role in an activity such that the agent can be assigned some degree of responsibility for the activity taking place. An agent can be a person, a piece of software or an organisation.
In rook we use agents for:
software like rook and daops,
organisations like Copernicus Climate Data Store.
Namespaces¶
- W3C PROV
Using URIs and namespaces, a provenance record can draw from multiple sources on the Web.
We use namespaces to use existing PROV vocabularies
like prov:SoftwareAgent
. These are for example:
PROV (by W3C): https://www.w3.org/ns/prov/
PROVONE (by DataONE): https://purl.dataone.org/provone/2015/01/15/ontology
dcterms (Dublin Core Metadata): https://dublincore.org/specifications/dublin-core/dcmi-terms/
Subset Example¶
The activity subset
is started by the software agent daops
(Python library)
which was triggered by rook
(data-reduction service).
The NetCDF file tas_day_...nc
entity was derived from c3s-cmip6
dataset entity
using the activity subset
.
Workflow Example¶
- W3C PROV Plans
Activities may follow pre-defined procedures, such as recipes, tutorials, instructions, or workflows. PROV refers to these, in general, as plans.
In W3C PROV workflows are named plans.
The activity orchestrate
is started by the agent rook
. It uses
a workflow document entity
(plan) which consists of a subset
and average
activity. These activities are started by the software agent daops
.
Example: Workflow with Subsetting Operators¶
The rooki client for rook
has example notebooks for process executions
and displaying the provenance information.
You can run the orchestrate
process to execute a workflow with subsetting operators
and show the provenance document:
1from rooki import operators as ops
2wf = ops.Subset(
3 ops.Subset(
4 ops.Input(
5 'tas', ['c3s-cmip6.ScenarioMIP.INM.INM-CM5-0.ssp245.r1i1p1f1.day.tas.gr1.v20190619']
6 ),
7 time="2016-01-01/2020-12-30",
8 ),
9 time="2017-01-01/2017-12-30",
10)
11resp = wf.orchestrate()
12# show URLs of output files
13resp.download_urls()
14# show URL to provenance document
15resp.provenance()
16# show URL to provenance image
17resp.provenance_image()
The response of the process includes a provenance document in PROV-JSON format:
{
"prefix": {
"provone": "http://purl.dataone.org/provone/2015/01/15/ontology#",
"dcterms": "http://purl.org/dc/terms/",
"default": "http://purl.org/roocs/prov#"
},
"agent": {
"copernicus_CDS": {
"prov:type": "prov:Organization",
"dcterms:title": "Copernicus Climate Data Store"
},
"rook": {
"prov:type": "prov:SoftwareAgent",
"dcterms:source": "https://github.com/roocs/rook/releases/tag/v0.2.0"
},
"daops": {
"prov:type": "prov:SoftwareAgent",
"dcterms:source": "https://github.com/roocs/daops/releases/tag/v0.3.0"
}
},
"wasAttributedTo": {
"_:id1": {
"prov:entity": "rook",
"prov:agent": "copernicus_CDS"
}
},
"entity": {
"workflow": {
"prov:type": "provone:Workflow"
},
"c3s-cmip6.ScenarioMIP.INM.INM-CM5-0.ssp245.r1i1p1f1.day.tas.gr1.v20190619": {},
"tas_day_INM-CM5-0_ssp245_r1i1p1f1_gr1_20160101-20201229.nc": [{}, {}],
"tas_day_INM-CM5-0_ssp245_r1i1p1f1_gr1_20170101-20171229.nc": {}
},
"activity": {
"orchestrate": [{
"prov:startedAtTime": "2021-02-15T13:24:33"
}, {
"prov:endedAtTime": "2021-02-15T13:24:57"
}],
"subset_tas_1": {
"time": "2016-01-01/2020-12-30",
"apply_fixes": false
},
"subset_tas_2": {
"time": "2017-01-01/2017-12-30",
"apply_fixes": false
}
},
"wasAssociatedWith": {
"_:id2": {
"prov:activity": "orchestrate",
"prov:agent": "rook",
"prov:plan": "workflow"
},
"_:id3": {
"prov:activity": "subset_tas_1",
"prov:agent": "daops",
"prov:plan": "workflow"
},
"_:id5": {
"prov:activity": "subset_tas_2",
"prov:agent": "daops",
"prov:plan": "workflow"
}
},
"wasDerivedFrom": {
"_:id4": {
"prov:generatedEntity": "tas_day_INM-CM5-0_ssp245_r1i1p1f1_gr1_20160101-20201229.nc",
"prov:usedEntity": "c3s-cmip6.ScenarioMIP.INM.INM-CM5-0.ssp245.r1i1p1f1.day.tas.gr1.v20190619",
"prov:activity": "subset_tas_1"
},
"_:id6": {
"prov:generatedEntity": "tas_day_INM-CM5-0_ssp245_r1i1p1f1_gr1_20170101-20171229.nc",
"prov:usedEntity": "tas_day_INM-CM5-0_ssp245_r1i1p1f1_gr1_20160101-20201229.nc",
"prov:activity": "subset_tas_2"
}
}
}
This provenance document can also be displayed as an image: