Provenance

Introduction

The rook processes are recording provenance information about the process execution details. This information includes:

  • used software and versions (rook, daops, …)

  • applied operators like subset and average

  • used input data and parameters (cmip6 dataset, time, area)

  • generated outputs (NetCDF files)

  • execution time (start-time and end-time)

This information is described with the W3C PROV standard and using the Python PROV Library

Overview of PROV

The W3C PROV Primer document gives an overview of the W3C PROV standard.

_images/prov-overview.png

A PROV document consists of agents, activities and entities. These can be connected via PROV relations like wasDerivedFrom.

Entities

W3C PROV

In PROV, physical, digital, conceptual, or other kinds of thing are called entities.

In rook we use entities for:

  • workflow description,

  • input datasets and

  • resulting output NetCDF files.

Activities

W3C PROV

Activities are how entities come into existence and how their attributes change to become new entities, often making use of previously existing entities to achieve this.

In rook we use activities for:

  • operators like subset and average.

  • processes like orchestrate to run a workflow.

Agent

W3C PROV

An agent takes a role in an activity such that the agent can be assigned some degree of responsibility for the activity taking place. An agent can be a person, a piece of software or an organisation.

In rook we use agents for:

  • software like rook and daops,

  • organisations like Copernicus Climate Data Store.

Namespaces

W3C PROV

Using URIs and namespaces, a provenance record can draw from multiple sources on the Web.

We use namespaces to use existing PROV vocabularies like prov:SoftwareAgent. These are for example:

Subset Example

_images/prov-subset.png

The activity subset is started by the software agent daops (Python library) which was triggered by rook (data-reduction service).

The NetCDF file tas_day_...nc entity was derived from c3s-cmip6 dataset entity using the activity subset.

Workflow Example

_images/prov-workflow.png
W3C PROV Plans

Activities may follow pre-defined procedures, such as recipes, tutorials, instructions, or workflows. PROV refers to these, in general, as plans.

In W3C PROV workflows are named plans.

The activity orchestrate is started by the agent rook. It uses a workflow document entity (plan) which consists of a subset and average activity. These activities are started by the software agent daops.

Example: Workflow with Subsetting Operators

The rooki client for rook has example notebooks for process executions and displaying the provenance information.

You can run the orchestrate process to execute a workflow with subsetting operators and show the provenance document:

 1from rooki import operators as ops
 2wf = ops.Subset(
 3      ops.Subset(
 4          ops.Input(
 5              'tas', ['c3s-cmip6.ScenarioMIP.INM.INM-CM5-0.ssp245.r1i1p1f1.day.tas.gr1.v20190619']
 6          ),
 7          time="2016-01-01/2020-12-30",
 8      ),
 9      time="2017-01-01/2017-12-30",
10)
11resp = wf.orchestrate()
12# show URLs of output files
13resp.download_urls()
14# show URL to provenance document
15resp.provenance()
16# show URL to provenance image
17resp.provenance_image()

The response of the process includes a provenance document in PROV-JSON format:

{
  "prefix": {
    "provone": "http://purl.dataone.org/provone/2015/01/15/ontology#",
    "dcterms": "http://purl.org/dc/terms/",
    "default": "http://purl.org/roocs/prov#"
  },
  "agent": {
    "copernicus_CDS": {
      "prov:type": "prov:Organization",
      "dcterms:title": "Copernicus Climate Data Store"
    },
    "rook": {
      "prov:type": "prov:SoftwareAgent",
      "dcterms:source": "https://github.com/roocs/rook/releases/tag/v0.2.0"
    },
    "daops": {
      "prov:type": "prov:SoftwareAgent",
      "dcterms:source": "https://github.com/roocs/daops/releases/tag/v0.3.0"
    }
  },
  "wasAttributedTo": {
    "_:id1": {
      "prov:entity": "rook",
      "prov:agent": "copernicus_CDS"
    }
  },
  "entity": {
    "workflow": {
      "prov:type": "provone:Workflow"
    },
    "c3s-cmip6.ScenarioMIP.INM.INM-CM5-0.ssp245.r1i1p1f1.day.tas.gr1.v20190619": {},
    "tas_day_INM-CM5-0_ssp245_r1i1p1f1_gr1_20160101-20201229.nc": [{}, {}],
    "tas_day_INM-CM5-0_ssp245_r1i1p1f1_gr1_20170101-20171229.nc": {}
  },
  "activity": {
    "orchestrate": [{
      "prov:startedAtTime": "2021-02-15T13:24:33"
    }, {
      "prov:endedAtTime": "2021-02-15T13:24:57"
    }],
    "subset_tas_1": {
      "time": "2016-01-01/2020-12-30",
      "apply_fixes": false
    },
    "subset_tas_2": {
      "time": "2017-01-01/2017-12-30",
      "apply_fixes": false
    }
  },
  "wasAssociatedWith": {
    "_:id2": {
      "prov:activity": "orchestrate",
      "prov:agent": "rook",
      "prov:plan": "workflow"
    },
    "_:id3": {
      "prov:activity": "subset_tas_1",
      "prov:agent": "daops",
      "prov:plan": "workflow"
    },
    "_:id5": {
      "prov:activity": "subset_tas_2",
      "prov:agent": "daops",
      "prov:plan": "workflow"
    }
  },
  "wasDerivedFrom": {
    "_:id4": {
      "prov:generatedEntity": "tas_day_INM-CM5-0_ssp245_r1i1p1f1_gr1_20160101-20201229.nc",
      "prov:usedEntity": "c3s-cmip6.ScenarioMIP.INM.INM-CM5-0.ssp245.r1i1p1f1.day.tas.gr1.v20190619",
      "prov:activity": "subset_tas_1"
    },
    "_:id6": {
      "prov:generatedEntity": "tas_day_INM-CM5-0_ssp245_r1i1p1f1_gr1_20170101-20171229.nc",
      "prov:usedEntity": "tas_day_INM-CM5-0_ssp245_r1i1p1f1_gr1_20160101-20201229.nc",
      "prov:activity": "subset_tas_2"
    }
  }
}

This provenance document can also be displayed as an image:

Provenance Example