Skip to content

Configuration reference

Pipelines are configured in YAML. The file is parsed with yaml.safe_load (no Python code execution) and validated against Pydantic models before any data is loaded.

Top-level schema

source:          # single data source (mutually exclusive with sources:)
  ...

sources:         # placeholder — multi-source support (coming soon)
  name:
    ...

output:
  ...

cluster:         # optional Dask cluster settings
  ...

metadata:        # optional: extra CF attributes to merge into the output
  ...

calls:
  - func: <name>
    ...

Multi-source (upcoming)

The sources field is a placeholder for a future feature. When implemented, each named source will map to a separate Zarr group in the output, allowing datasets on different grids to coexist without interpolation:

# Not yet implemented — schema is validated but will raise NotImplementedError at runtime
sources:
  waves:
    type: xarray
    urlpath: gs://bucket/waves.zarr
  wind:
    type: xarray
    urlpath: gs://bucket/wind.zarr

output:
  outfile: ./combined.zarr   # will write combined.zarr/waves and combined.zarr/wind

calls:
  - source: waves
    func: mean
    data_vars: [hs]
  - source: wind
    func: mean
    data_vars: [wspd]