Configuration Management#

zarrio provides flexible configuration management using Pydantic models for type safety and validation.

Pydantic Models#

All configuration in zarrio is managed through Pydantic models, which provide:

  • Type safety with automatic validation

  • Clear error messages for invalid configurations

  • Support for nested configurations

  • Serialization to/from YAML and JSON

ZarrConverterConfig#

The main configuration class is ZarrConverterConfig:

from zarrio.models import ZarrConverterConfig

config = ZarrConverterConfig(
    chunking={"time": 100, "lat": 50, "lon": 100},
    compression={"method": "blosc:zstd:3"},
    packing={"enabled": True, "bits": 16},
    time={"dim": "time", "append_dim": "time"},
    variables={"include": ["temperature"], "exclude": ["humidity"]},
    target_chunk_size_mb=100,  # Configurable target chunk size
    attrs={"title": "Demo dataset", "source": "zarrio"}
)

The ZarrConverterConfig supports the following fields:

  • chunking: Chunking configuration (time, lat, lon, depth dimensions)

  • compression: Compression settings (method, cname, clevel, shuffle)

  • packing: Data packing configuration (enabled, bits)

  • time: Time dimension configuration (dim, append_dim, global_start, global_end, freq)

  • variables: Variable selection (include, exclude)

  • missing_data: Missing data handling (check_vars, retries_on_missing, missing_check_vars)

  • datamesh: Datamesh integration configuration

  • attrs: Global attributes to add to the dataset

  • target_chunk_size_mb: Target chunk size in MB for intelligent chunking (default: 50)

  • access_pattern: Access pattern for chunking optimization (‘temporal’, ‘spatial’, ‘balanced’)

  • retries_on_missing: Number of retries for missing data (backward compatibility)

  • missing_check_vars: Variables to check for missing data (backward compatibility)

Chunking Configuration#

Chunking can be configured through the ChunkingConfig model:

from zarrio.models import ChunkingConfig

chunking = ChunkingConfig(
    time=100,
    lat=50,
    lon=100
)

Packing Configuration#

Packing can be configured through the PackingConfig model:

from zarrio.models import PackingConfig

packing = PackingConfig(
    enabled=True,
    bits=16
)

Enhanced Packing Features#

The enhanced packing functionality provides several improvements over basic packing:

Priority-Based Range Determination:

The enhanced packing system uses a clear priority order for determining the min/max values used for packing:

  1. Manual ranges (if provided)

  2. Variable attributes (valid_min/valid_max)

  3. Automatic calculation from data (with warnings)

Manual Range Specification:

Users can explicitly specify min/max ranges for variables:

from zarrio.models import PackingConfig

packing = PackingConfig(
    enabled=True,
    bits=16,
    manual_ranges={
        "temperature": {"min": -50, "max": 50},
        "pressure": {"min": 90000, "max": 110000}
    }
)
Automatic Range Calculation with Buffer:

When no ranges are provided, the system automatically calculates them from the data:

from zarrio.models import PackingConfig

packing = PackingConfig(
    enabled=True,
    bits=16,
    auto_buffer_factor=0.05  # 5% buffer
)
Range Exceeded Validation:

Optional checking to ensure data doesn’t exceed specified ranges:

from zarrio.models import PackingConfig

packing = PackingConfig(
    enabled=True,
    bits=16,
    manual_ranges={"temperature": {"min": -50, "max": 50}},
    check_range_exceeded=True,
    range_exceeded_action="error"  # or "warn" or "ignore"
)

Compression Configuration#

Compression can be configured through the CompressionConfig model:

from zarrio.models import CompressionConfig

compression = CompressionConfig(
    method="blosc:zstd:3",
    cname="zstd",
    clevel=3,
    shuffle="shuffle"
)

Time Configuration#

Time handling can be configured through the TimeConfig model:

from zarrio.models import TimeConfig

time_config = TimeConfig(
    dim="time",
    append_dim="time",
    global_start="2020-01-01",
    global_end="2023-12-31",
    freq="1D"
)

Access Pattern Configuration#

The access_pattern field controls how intelligent chunking optimizes the data layout:

  • temporal: Optimized for time series analysis (large time chunks, smaller spatial chunks)

  • spatial: Optimized for spatial analysis (smaller time chunks, larger spatial chunks)

  • balanced: Good performance for mixed workloads (moderate chunking across all dimensions)

Programmatic Configuration:

from zarrio.models import ZarrConverterConfig

config = ZarrConverterConfig(
    access_pattern="temporal"  # Optimize for time series analysis
)

YAML Configuration:

# config.yaml
access_pattern: temporal
chunking:
  time: 100
  lat: 50
  lon: 100

JSON Configuration:

{
  "access_pattern": "temporal",
  "chunking": {
    "time": 100,
    "lat": 50,
    "lon": 100
  }
}

Variable handling can be configured through the VariableConfig model:

from zarrio.models import VariableConfig

variables = VariableConfig(
    include=["temperature", "pressure"],
    exclude=["humidity"]
)

Loading from Files#

Configuration can be loaded from YAML or JSON files:

YAML Example:#

# config.yaml
chunking:
  time: 150
  lat: 60
  lon: 120
compression:
  method: blosc:zstd:2
  clevel: 2
packing:
  enabled: true
  bits: 16
time:
  dim: time
  append_dim: time
  global_start: "2020-01-01"
  global_end: "2023-12-31"
  freq: "1D"
variables:
  include:
    - temperature
    - pressure
  exclude:
    - humidity
attrs:
  title: YAML Config Demo
  version: 1.0

Loading:#

from zarrio.models import ZarrConverterConfig

config = ZarrConverterConfig.from_yaml_file("config.yaml")

JSON Example:#

{
  "chunking": {
    "time": 125,
    "lat": 55,
    "lon": 110
  },
  "compression": {
    "method": "blosc:lz4:1",
    "clevel": 1
  },
  "packing": {
    "enabled": true,
    "bits": 8
  },
  "time": {
    "dim": "time",
    "append_dim": "time",
    "global_start": "2020-01-01",
    "global_end": "2023-12-31",
    "freq": "1D"
  },
  "variables": {
    "include": ["temperature", "pressure"],
    "exclude": ["humidity"]
  },
  "attrs": {
    "title": "JSON Config Demo",
    "version": "1.0"
  }
}

Loading:#

from zarrio.models import ZarrConverterConfig

config = ZarrConverterConfig.from_json_file("config.json")

Programmatic Configuration#

Configuration can also be created programmatically:

from zarrio.models import (
    ZarrConverterConfig,
    ChunkingConfig,
    PackingConfig,
    CompressionConfig,
    TimeConfig,
    VariableConfig
)

config = ZarrConverterConfig(
    chunking=ChunkingConfig(time=100, lat=50, lon=100),
    compression=CompressionConfig(method="blosc:zstd:3"),
    packing=PackingConfig(enabled=True, bits=16),
    time=TimeConfig(dim="time", append_dim="time"),
    variables=VariableConfig(include=["temperature"], exclude=["humidity"]),
    attrs={"title": "Programmatic Config Demo", "source": "zarrio"}
)

Validation#

Pydantic models automatically validate configurations:

from zarrio.models import ZarrConverterConfig

try:
    config = ZarrConverterConfig(
        chunking={"time": 1000, "lat": 1, "lon": 1}  # Will generate warnings
    )
except ValidationError as e:
    print(f"Configuration error: {e}")

Missing Data Handling Configuration#

zarrio provides robust missing data handling through the MissingDataConfig model:

from zarrio.models import ZarrConverterConfig, MissingDataConfig

config = ZarrConverterConfig(
    missing_data=MissingDataConfig(
        missing_check_vars="all",      # Check all variables for missing data
        retries_on_missing=3           # Retry up to 3 times if missing data is detected
    )
)

The MissingDataConfig supports the following fields:

  • missing_check_vars: Variables to check for missing values (“all”, None, or list of variable names) (default: “all”)

  • retries_on_missing: Number of retries if missing values are encountered (default: 0, range: 0-10)

The missing data detection system automatically checks for missing data after writing operations by comparing the source dataset with the written Zarr archive. This is particularly useful for detecting transient failures during parallel processing.

When missing data is detected and retries are enabled, zarrio automatically retries the operation with exponential backoff:

  1. First retry: 0.1 second delay

  2. Second retry: 0.2 second delay

  3. Third retry: 0.3 second delay

  4. And so on…

This approach allows the system to recover from transient issues such as network instability, file system contention, or memory pressure.

Check all variables with 2 retries:

config = ZarrConverterConfig(
    missing_data=MissingDataConfig(
        missing_check_vars="all",
        retries_on_missing=2
    )
)

Check specific variables with 3 retries:

config = ZarrConverterConfig(
    missing_data=MissingDataConfig(
        missing_check_vars=["temperature", "pressure"],
        retries_on_missing=3
    )
)

Disable missing data checking:

config = ZarrConverterConfig(
    missing_data=MissingDataConfig(
        missing_check_vars=None,
        retries_on_missing=0
    )
)

Missing data handling can also be configured in YAML files:

# config.yaml
missing_data:
  missing_check_vars: "all"
  retries_on_missing: 3
chunking:
  time: 100
  lat: 50
  lon: 100

Missing data handling options are available through the CLI:

# Convert with retry logic
zarrio convert input.nc output.zarr --retries-on-missing 3

# Append with retry logic
zarrio append new_data.nc existing.zarr --retries-on-missing 2

# Create template with retry logic
zarrio create-template template.nc archive.zarr --retries-on-missing 1

# Write region with retry logic
zarrio write-region data.nc archive.zarr --retries-on-missing 2

Access Pattern Configuration#

The access_pattern field controls how intelligent chunking optimizes the data layout for different access patterns:

  • temporal: Optimized for time series analysis (large time chunks, smaller spatial chunks)

  • spatial: Optimized for spatial analysis (smaller time chunks, larger spatial chunks)

  • balanced: Good performance for mixed workloads (moderate chunking across all dimensions)

When intelligent_chunking is enabled in the create-template command, the access pattern determines how the chunking is calculated based on the full archive dimensions.

Programmatic Configuration:

from zarrio.models import ZarrConverterConfig

config = ZarrConverterConfig(
    access_pattern="temporal"  # Optimize for time series analysis
)

YAML Configuration:

# config.yaml
access_pattern: temporal
chunking:
  time: 100
  lat: 50
  lon: 100

JSON Configuration:

{
  "access_pattern": "temporal",
  "chunking": {
    "time": 100,
    "lat": 50,
    "lon": 100
  }
}

CLI Usage:

# Create template with intelligent chunking for temporal analysis
zarrio create-template template.nc archive.zarr \\
    --global-start 2020-01-01 \\
    --global-end 2023-12-31 \\
    --intelligent-chunking \\
    --access-pattern temporal

# Create template with intelligent chunking for spatial analysis
zarrio create-template template.nc archive.zarr \\
    --global-start 2020-01-01 \\
    --global-end 2023-12-31 \\
    --intelligent-chunking \\
    --access-pattern spatial

# Create template with intelligent chunking for balanced access
zarrio create-template template.nc archive.zarr \\
    --global-start 2020-01-01 \\
    --global-end 2023-12-31 \\
    --intelligent-chunking \\
    --access-pattern balanced
  1. Moderate Retries: Use 2-3 retries for most use cases

  2. Aggressive Retries: Increase for very large or unstable environments

  3. Variable Selection: Specify missing_check_vars to focus on critical variables

  4. Monitoring: Watch logs for frequent retries which might indicate underlying issues

  5. Disabled Retries: Set to 0 for deterministic environments where failures should be immediate errors

Configuration with ZarrConverter#

Once you have a configuration, you can use it with ZarrConverter:

from zarrio import ZarrConverter
from zarrio.models import ZarrConverterConfig

config = ZarrConverterConfig(
    chunking={"time": 100, "lat": 50, "lon": 100},
    compression="blosc:zstd:3",
    packing=True
)

converter = ZarrConverter(config=config)
converter.convert("input.nc", "output.zarr")

CLI Configuration#

Configuration files can also be used with the CLI:

zarrio convert input.nc output.zarr --config config.yaml

Environment Variables#

Some configuration can also be set through environment variables:

export ONZARR2_LOG_LEVEL=DEBUG
export ONZARR2_CHUNK_SIZE_TIME=100
export ONZARR2_COMPRESSION=blosc:zstd:3

Default Configuration#

zarrio provides sensible defaults for all configuration options:

from zarrio.models import ZarrConverterConfig

# Default config
config = ZarrConverterConfig()

# Shows default values:
# - chunking: {}
# - compression: None
# - packing: {"enabled": False, "bits": 16}
# - time: {"dim": "time", "append_dim": "time"}
# - variables: {"include": None, "exclude": None}
# - attrs: {}

Best Practices#

  1. Use Configuration Files: For complex setups, use YAML or JSON configuration files

  2. Validate Early: Let Pydantic validate your configurations before using them

  3. Type Safety: Take advantage of type hints for better development experience

  4. Environment Overrides: Use environment variables for deployment-specific settings

  5. Documentation: Document your configuration files with comments

Example Configuration File with Comments:#

# config.yaml - Production configuration for zarrio
#
# This configuration is optimized for:
# - Large climate datasets (1° global daily data)
# - Balanced access patterns (time series and spatial analysis)
# - High compression ratio
# - Efficient storage

# Chunking strategy optimized for balanced access patterns
chunking:
  time: 100    # 100 time steps per chunk (about 3-4 months)
  lat: 50      # 50 latitude points per chunk (about 50°)
  lon: 100     # 100 longitude points per chunk (about 100°)

# Compression settings (Blosc with Zstd level 3)
compression:
  method: blosc:zstd:3
  cname: zstd
  clevel: 3
  shuffle: shuffle

# Data packing (16-bit packing for float32 data)
packing:
  enabled: true
  bits: 16

# Time dimension configuration
time:
  dim: time
  append_dim: time
  # global_start: "2020-01-01"  # Set at runtime
  # global_end: "2023-12-31"    # Set at runtime
  freq: "1D"

# Variable selection
variables:
  include:       # Only include these variables
    - temperature
    - pressure
    - humidity
    - wind_speed
  exclude:       # Exclude these variables (takes precedence)
    - unused_var

# Global attributes to add
attrs:
  title: "Production Climate Dataset"
  institution: "Oceanum"
  source: "zarrio"
  processing_date: "2023-01-01"
  version: "1.0"

# Retry configuration for missing data
retries_on_missing: 3
missing_check_vars: all