Dataset Processing Flow

This page documents the request decision tree that remains after the director cleanup. There is no longer a Director class; the name now refers to the small wrapper around request planning and execution used by WPS processes.

The diagram is written as a blueprint for a future reimplementation: first decide what kind of input was received, then decide whether catalog metadata is needed, then choose between returning existing files and running an operation.

Director Decision Tree

        flowchart TD
    Start(["Request arrives from a WPS process<br/>or workflow stage"])

    subgraph Input["1. Understand the input"]
        Start --> HasFiles{"Do we already have concrete files<br/>from an earlier workflow step?"}
        HasFiles -- yes --> DirectFiles["Normalize those files<br/>and run the requested operation"]
        HasFiles -- no --> Project["Read the project from<br/>the requested collection id"]
    end

    subgraph Catalog["2. Resolve catalog data when needed"]
        Project --> UsesCatalog{"Does this project use<br/>a Rook catalog?"}
        UsesCatalog -- no --> PlainOperation["Keep the original collection<br/>and run the operation"]
        UsesCatalog -- yes --> Search["Search the catalog<br/>for collection and time"]
        Search --> Found{"Did the catalog find<br/>every requested collection?"}
        Found -- no --> Reject["Reject the request<br/>as an invalid collection"]
    end

    subgraph Choice["3. Choose response type"]
        Found -- yes --> WantsOriginal{"Can the request return<br/>existing catalog files?"}
        WantsOriginal -- "yes: original_files<br/>or atlas shortcut" --> CatalogOriginal["Return catalog download URLs"]
        WantsOriginal -- no --> ChangesData{"Does the operation need<br/>new data to be written?"}
        ChangesData -- "yes: average, regrid,<br/>or dimension change" --> CatalogOperation["Resolve catalog files<br/>for processing"]
        ChangesData -- no --> Aligned{"Does the subset match<br/>whole source files?"}
        Aligned -- yes --> AlignedOriginal["Return only the matching<br/>download URLs"]
        Aligned -- no --> CatalogOperation
    end

    subgraph Run["4. Execute or adapt the result"]
        DirectFiles --> BuildSources["Build dataset sources"]
        PlainOperation --> BuildSources
        CatalogOperation --> BuildSources
        BuildSources --> Open["Detect data format and transport<br/>NetCDF, Zarr, Kerchunk, file, HTTP, S3"]
        Open --> Fixes["Apply internal dataset fixes<br/>when a dataset id is known"]
        Fixes --> Operation["Run subset, average,<br/>regrid, concat, or weighted average"]
        CatalogOriginal --> OriginalResponse["Return original-file response"]
        AlignedOriginal --> OriginalResponse
        Operation --> OutputResponse["Return operation output files"]
    end
    

Decision Ownership

rook.operations.execution.Operator.call decides whether a request is already a file list from a previous workflow step. Those requests bypass catalog planning and run the operation runner directly with a FileMapper.

rook.director.planning.plan_request handles catalog-backed requests. It resolves the project, validates catalog search results, and chooses between an original-file response and operation execution.

rook.director.execution.execute_plan adapts the plan into output URIs. It collects original file URLs when processing is skipped, otherwise it prepares operation inputs and calls the operation runner.

rook.operations.consolidate converts operation collections into DatasetSource values. It keeps direct Zarr, Kerchunk, and S3 inputs out of catalog lookup, resolves catalog-backed NetCDF datasets to files, and preserves dataset IDs where they are needed for dataset fixes.

rook.io.datasets owns format and transport detection, storage options, and dataset opening. Catalog-specific fixes are applied only when a DatasetSource has a dataset ID.

Blueprint for Reimplementation

The future director should be a planner, not an operation runner. It should return one explicit decision value that describes what the caller must do next:

  • reject the request with a known error;

  • return original files;

  • run an operation with the original collection;

  • run an operation with catalog-resolved dataset sources.

The planner should keep these responsibilities separate:

  • input classification: workflow files versus collection IDs;

  • project and catalog resolution;

  • original-file eligibility;

  • subset-to-file alignment;

  • construction of operation sources;

  • WPS response and exception adaptation.

The execution side should be boring on purpose. Given a plan, it should either collect original-file URLs or prepare operation inputs and call the supplied runner. It should not repeat catalog decisions.

A future type model could make the decision tree easier to read in code:

RequestDecision = (
    InvalidRequest
    | ReturnOriginalFiles
    | RunWithOriginalCollection
    | RunWithResolvedSources
)

The important boundary is that catalog planning decides what should happen, while operation execution decides how to run the selected operation.