Configuration Management#
zarrio provides flexible configuration management using Pydantic models for type safety and validation.
Pydantic Models#
All configuration in zarrio is managed through Pydantic models, which provide:
Type safety with automatic validation
Clear error messages for invalid configurations
Support for nested configurations
Serialization to/from YAML and JSON
ZarrConverterConfig#
The main configuration class is ZarrConverterConfig:
from zarrio.models import ZarrConverterConfig
config = ZarrConverterConfig(
chunking={"time": 100, "lat": 50, "lon": 100},
compression={"method": "blosc:zstd:3"},
packing={"enabled": True, "bits": 16},
time={"dim": "time", "append_dim": "time"},
variables={"include": ["temperature"], "exclude": ["humidity"]},
target_chunk_size_mb=100, # Configurable target chunk size
attrs={"title": "Demo dataset", "source": "zarrio"}
)
The ZarrConverterConfig
supports the following fields:
chunking: Chunking configuration (time, lat, lon, depth dimensions)
compression: Compression settings (method, cname, clevel, shuffle)
packing: Data packing configuration (enabled, bits)
time: Time dimension configuration (dim, append_dim, global_start, global_end, freq)
variables: Variable selection (include, exclude)
missing_data: Missing data handling (check_vars, retries_on_missing, missing_check_vars)
datamesh: Datamesh integration configuration
attrs: Global attributes to add to the dataset
target_chunk_size_mb: Target chunk size in MB for intelligent chunking (default: 50)
access_pattern: Access pattern for chunking optimization (‘temporal’, ‘spatial’, ‘balanced’)
retries_on_missing: Number of retries for missing data (backward compatibility)
missing_check_vars: Variables to check for missing data (backward compatibility)
Chunking Configuration#
Chunking can be configured through the ChunkingConfig model:
from zarrio.models import ChunkingConfig
chunking = ChunkingConfig(
time=100,
lat=50,
lon=100
)
Packing Configuration#
Packing can be configured through the PackingConfig model:
from zarrio.models import PackingConfig
packing = PackingConfig(
enabled=True,
bits=16
)
Enhanced Packing Features#
The enhanced packing functionality provides several improvements over basic packing:
- Priority-Based Range Determination:
The enhanced packing system uses a clear priority order for determining the min/max values used for packing:
Manual ranges (if provided)
Variable attributes (valid_min/valid_max)
Automatic calculation from data (with warnings)
- Manual Range Specification:
Users can explicitly specify min/max ranges for variables:
from zarrio.models import PackingConfig packing = PackingConfig( enabled=True, bits=16, manual_ranges={ "temperature": {"min": -50, "max": 50}, "pressure": {"min": 90000, "max": 110000} } )
- Automatic Range Calculation with Buffer:
When no ranges are provided, the system automatically calculates them from the data:
from zarrio.models import PackingConfig packing = PackingConfig( enabled=True, bits=16, auto_buffer_factor=0.05 # 5% buffer )
- Range Exceeded Validation:
Optional checking to ensure data doesn’t exceed specified ranges:
from zarrio.models import PackingConfig packing = PackingConfig( enabled=True, bits=16, manual_ranges={"temperature": {"min": -50, "max": 50}}, check_range_exceeded=True, range_exceeded_action="error" # or "warn" or "ignore" )
Compression Configuration#
Compression can be configured through the CompressionConfig model:
from zarrio.models import CompressionConfig
compression = CompressionConfig(
method="blosc:zstd:3",
cname="zstd",
clevel=3,
shuffle="shuffle"
)
Time Configuration#
Time handling can be configured through the TimeConfig model:
from zarrio.models import TimeConfig
time_config = TimeConfig(
dim="time",
append_dim="time",
global_start="2020-01-01",
global_end="2023-12-31",
freq="1D"
)
Access Pattern Configuration#
The access_pattern
field controls how intelligent chunking optimizes the data layout:
temporal: Optimized for time series analysis (large time chunks, smaller spatial chunks)
spatial: Optimized for spatial analysis (smaller time chunks, larger spatial chunks)
balanced: Good performance for mixed workloads (moderate chunking across all dimensions)
Programmatic Configuration:
from zarrio.models import ZarrConverterConfig
config = ZarrConverterConfig(
access_pattern="temporal" # Optimize for time series analysis
)
YAML Configuration:
# config.yaml
access_pattern: temporal
chunking:
time: 100
lat: 50
lon: 100
JSON Configuration:
{
"access_pattern": "temporal",
"chunking": {
"time": 100,
"lat": 50,
"lon": 100
}
}
Variable handling can be configured through the VariableConfig model:
from zarrio.models import VariableConfig
variables = VariableConfig(
include=["temperature", "pressure"],
exclude=["humidity"]
)
Loading from Files#
Configuration can be loaded from YAML or JSON files:
YAML Example:#
# config.yaml
chunking:
time: 150
lat: 60
lon: 120
compression:
method: blosc:zstd:2
clevel: 2
packing:
enabled: true
bits: 16
time:
dim: time
append_dim: time
global_start: "2020-01-01"
global_end: "2023-12-31"
freq: "1D"
variables:
include:
- temperature
- pressure
exclude:
- humidity
attrs:
title: YAML Config Demo
version: 1.0
Loading:#
from zarrio.models import ZarrConverterConfig
config = ZarrConverterConfig.from_yaml_file("config.yaml")
JSON Example:#
{
"chunking": {
"time": 125,
"lat": 55,
"lon": 110
},
"compression": {
"method": "blosc:lz4:1",
"clevel": 1
},
"packing": {
"enabled": true,
"bits": 8
},
"time": {
"dim": "time",
"append_dim": "time",
"global_start": "2020-01-01",
"global_end": "2023-12-31",
"freq": "1D"
},
"variables": {
"include": ["temperature", "pressure"],
"exclude": ["humidity"]
},
"attrs": {
"title": "JSON Config Demo",
"version": "1.0"
}
}
Loading:#
from zarrio.models import ZarrConverterConfig
config = ZarrConverterConfig.from_json_file("config.json")
Programmatic Configuration#
Configuration can also be created programmatically:
from zarrio.models import (
ZarrConverterConfig,
ChunkingConfig,
PackingConfig,
CompressionConfig,
TimeConfig,
VariableConfig
)
config = ZarrConverterConfig(
chunking=ChunkingConfig(time=100, lat=50, lon=100),
compression=CompressionConfig(method="blosc:zstd:3"),
packing=PackingConfig(enabled=True, bits=16),
time=TimeConfig(dim="time", append_dim="time"),
variables=VariableConfig(include=["temperature"], exclude=["humidity"]),
attrs={"title": "Programmatic Config Demo", "source": "zarrio"}
)
Validation#
Pydantic models automatically validate configurations:
from zarrio.models import ZarrConverterConfig
try:
config = ZarrConverterConfig(
chunking={"time": 1000, "lat": 1, "lon": 1} # Will generate warnings
)
except ValidationError as e:
print(f"Configuration error: {e}")
Missing Data Handling Configuration#
zarrio provides robust missing data handling through the MissingDataConfig
model:
from zarrio.models import ZarrConverterConfig, MissingDataConfig
config = ZarrConverterConfig(
missing_data=MissingDataConfig(
missing_check_vars="all", # Check all variables for missing data
retries_on_missing=3 # Retry up to 3 times if missing data is detected
)
)
The MissingDataConfig
supports the following fields:
missing_check_vars: Variables to check for missing values (“all”, None, or list of variable names) (default: “all”)
retries_on_missing: Number of retries if missing values are encountered (default: 0, range: 0-10)
The missing data detection system automatically checks for missing data after writing operations by comparing the source dataset with the written Zarr archive. This is particularly useful for detecting transient failures during parallel processing.
When missing data is detected and retries are enabled, zarrio automatically retries the operation with exponential backoff:
First retry: 0.1 second delay
Second retry: 0.2 second delay
Third retry: 0.3 second delay
And so on…
This approach allows the system to recover from transient issues such as network instability, file system contention, or memory pressure.
Check all variables with 2 retries:
config = ZarrConverterConfig(
missing_data=MissingDataConfig(
missing_check_vars="all",
retries_on_missing=2
)
)
Check specific variables with 3 retries:
config = ZarrConverterConfig(
missing_data=MissingDataConfig(
missing_check_vars=["temperature", "pressure"],
retries_on_missing=3
)
)
Disable missing data checking:
config = ZarrConverterConfig(
missing_data=MissingDataConfig(
missing_check_vars=None,
retries_on_missing=0
)
)
Missing data handling can also be configured in YAML files:
# config.yaml
missing_data:
missing_check_vars: "all"
retries_on_missing: 3
chunking:
time: 100
lat: 50
lon: 100
Missing data handling options are available through the CLI:
# Convert with retry logic
zarrio convert input.nc output.zarr --retries-on-missing 3
# Append with retry logic
zarrio append new_data.nc existing.zarr --retries-on-missing 2
# Create template with retry logic
zarrio create-template template.nc archive.zarr --retries-on-missing 1
# Write region with retry logic
zarrio write-region data.nc archive.zarr --retries-on-missing 2
Access Pattern Configuration#
The access_pattern
field controls how intelligent chunking optimizes the data layout for different access patterns:
temporal: Optimized for time series analysis (large time chunks, smaller spatial chunks)
spatial: Optimized for spatial analysis (smaller time chunks, larger spatial chunks)
balanced: Good performance for mixed workloads (moderate chunking across all dimensions)
When intelligent_chunking
is enabled in the create-template
command, the access pattern determines how the chunking is calculated based on the full archive dimensions.
Programmatic Configuration:
from zarrio.models import ZarrConverterConfig
config = ZarrConverterConfig(
access_pattern="temporal" # Optimize for time series analysis
)
YAML Configuration:
# config.yaml
access_pattern: temporal
chunking:
time: 100
lat: 50
lon: 100
JSON Configuration:
{
"access_pattern": "temporal",
"chunking": {
"time": 100,
"lat": 50,
"lon": 100
}
}
CLI Usage:
# Create template with intelligent chunking for temporal analysis
zarrio create-template template.nc archive.zarr \\
--global-start 2020-01-01 \\
--global-end 2023-12-31 \\
--intelligent-chunking \\
--access-pattern temporal
# Create template with intelligent chunking for spatial analysis
zarrio create-template template.nc archive.zarr \\
--global-start 2020-01-01 \\
--global-end 2023-12-31 \\
--intelligent-chunking \\
--access-pattern spatial
# Create template with intelligent chunking for balanced access
zarrio create-template template.nc archive.zarr \\
--global-start 2020-01-01 \\
--global-end 2023-12-31 \\
--intelligent-chunking \\
--access-pattern balanced
Moderate Retries: Use 2-3 retries for most use cases
Aggressive Retries: Increase for very large or unstable environments
Variable Selection: Specify
missing_check_vars
to focus on critical variablesMonitoring: Watch logs for frequent retries which might indicate underlying issues
Disabled Retries: Set to 0 for deterministic environments where failures should be immediate errors
Configuration with ZarrConverter#
Once you have a configuration, you can use it with ZarrConverter:
from zarrio import ZarrConverter
from zarrio.models import ZarrConverterConfig
config = ZarrConverterConfig(
chunking={"time": 100, "lat": 50, "lon": 100},
compression="blosc:zstd:3",
packing=True
)
converter = ZarrConverter(config=config)
converter.convert("input.nc", "output.zarr")
CLI Configuration#
Configuration files can also be used with the CLI:
zarrio convert input.nc output.zarr --config config.yaml
Environment Variables#
Some configuration can also be set through environment variables:
export ONZARR2_LOG_LEVEL=DEBUG
export ONZARR2_CHUNK_SIZE_TIME=100
export ONZARR2_COMPRESSION=blosc:zstd:3
Default Configuration#
zarrio provides sensible defaults for all configuration options:
from zarrio.models import ZarrConverterConfig
# Default config
config = ZarrConverterConfig()
# Shows default values:
# - chunking: {}
# - compression: None
# - packing: {"enabled": False, "bits": 16}
# - time: {"dim": "time", "append_dim": "time"}
# - variables: {"include": None, "exclude": None}
# - attrs: {}
Best Practices#
Use Configuration Files: For complex setups, use YAML or JSON configuration files
Validate Early: Let Pydantic validate your configurations before using them
Type Safety: Take advantage of type hints for better development experience
Environment Overrides: Use environment variables for deployment-specific settings
Documentation: Document your configuration files with comments
Example Configuration File with Comments:#
# config.yaml - Production configuration for zarrio
#
# This configuration is optimized for:
# - Large climate datasets (1° global daily data)
# - Balanced access patterns (time series and spatial analysis)
# - High compression ratio
# - Efficient storage
# Chunking strategy optimized for balanced access patterns
chunking:
time: 100 # 100 time steps per chunk (about 3-4 months)
lat: 50 # 50 latitude points per chunk (about 50°)
lon: 100 # 100 longitude points per chunk (about 100°)
# Compression settings (Blosc with Zstd level 3)
compression:
method: blosc:zstd:3
cname: zstd
clevel: 3
shuffle: shuffle
# Data packing (16-bit packing for float32 data)
packing:
enabled: true
bits: 16
# Time dimension configuration
time:
dim: time
append_dim: time
# global_start: "2020-01-01" # Set at runtime
# global_end: "2023-12-31" # Set at runtime
freq: "1D"
# Variable selection
variables:
include: # Only include these variables
- temperature
- pressure
- humidity
- wind_speed
exclude: # Exclude these variables (takes precedence)
- unused_var
# Global attributes to add
attrs:
title: "Production Climate Dataset"
institution: "Oceanum"
source: "zarrio"
processing_date: "2023-01-01"
version: "1.0"
# Retry configuration for missing data
retries_on_missing: 3
missing_check_vars: all