Usage Examples#

This section provides practical examples of using zarrio for various scenarios.

Basic Conversion#

Simple conversion of a NetCDF file to Zarr format:

from zarrio import convert_to_zarr

# Convert a single NetCDF file to Zarr
convert_to_zarr("input.nc", "output.zarr")

# Convert with basic options
convert_to_zarr(
    "input.nc",
    "output.zarr",
    chunking={"time": 100, "lat": 50, "lon": 100},
    compression="blosc:zstd:3",
    packing=True,
    packing_bits=16
)

Advanced Conversion with Class-Based API#

For more control, use the class-based API:

from zarrio import ZarrConverter
from zarrio.models import ZarrConverterConfig

# Create configuration
config = ZarrConverterConfig(
    chunking=ChunkingConfig(time=100, lat=50, lon=100),
    compression=CompressionConfig(method="blosc:zstd:3"),
    packing=PackingConfig(enabled=True, bits=16),
    attrs={"title": "Demo dataset", "source": "zarrio"}
)

# Create converter
converter = ZarrConverter(config=config)

# Convert data
converter.convert("input.nc", "output.zarr")

Command-Line Interface#

zarrio provides a powerful command-line interface:

# Convert NetCDF to Zarr
zarrio convert input.nc output.zarr

# Convert with chunking
zarrio convert input.nc output.zarr --chunking "time:100,lat:50,lon:100"

# Convert with compression
zarrio convert input.nc output.zarr --compression "blosc:zstd:3"

# Convert with data packing
zarrio convert input.nc output.zarr --packing --packing-bits 16

# Convert with variable selection
zarrio convert input.nc output.zarr --variables "temperature,pressure"

# Convert with variable exclusion
zarrio convert input.nc output.zarr --drop-variables "humidity"

# Convert with additional attributes
zarrio convert input.nc output.zarr --attrs '{"title": "Demo dataset", "source": "zarrio"}'

Parallel Processing#

One of the key features of zarrio is parallel processing support:

Template Creation#

First, create a template Zarr archive covering the full time range:

from zarrio import ZarrConverter

# Create converter
converter = ZarrConverter(
    chunking=ChunkingConfig(time=100, lat=50, lon=100),
    compression=CompressionConfig(method="blosc:zstd:3"),
    packing=PackingConfig(enabled=True, bits=16)
)

# Create template covering full time range
converter.create_template(
    template_dataset=template_ds,
    output_path="archive.zarr",
    global_start="2020-01-01",
    global_end="2023-12-31",
    compute=False  # Metadata only, no data computation
)

Region Writing#

Then write regions in parallel processes:

# Process 1: Write first region
converter.write_region("data_2020.nc", "archive.zarr")

# Process 2: Write second region
converter.write_region("data_2021.nc", "archive.zarr")

# Process 3: Write third region
converter.write_region("data_2022.nc", "archive.zarr")

# Process 4: Write fourth region
converter.write_region("data_2023.nc", "archive.zarr")

CLI Parallel Processing#

You can also use the CLI for parallel processing:

# Create template for parallel writing
zarrio create-template template.nc archive.zarr \\
    --global-start 2020-01-01 \\
    --global-end 2023-12-31

# Write regions in parallel processes
zarrio write-region data1.nc archive.zarr  # Process 1
zarrio write-region data2.nc archive.zarr  # Process 2
zarrio write-region data3.nc archive.zarr  # Process 3
zarrio write-region data4.nc archive.zarr  # Process 4

Data Appending#

Append new data to existing Zarr stores:

from zarrio import append_to_zarr

# Append data to existing Zarr store
append_to_zarr("new_data.nc", "existing.zarr")

# Append with options
append_to_zarr(
    "new_data.nc",
    "existing.zarr",
    chunking={"time": 50, "lat": 25, "lon": 50},
    variables=["temperature", "pressure"],
    drop_variables=["humidity"]
)

Class-Based Appending#

from zarrio import ZarrConverter

# Create converter
converter = ZarrConverter(
    chunking=ChunkingConfig(time=50, lat=25, lon=50)
)

# Append data
converter.append("new_data.nc", "existing.zarr")

CLI Appending#

# Append data to existing Zarr store
zarrio append new_data.nc existing.zarr

# Append with options
zarrio append new_data.nc existing.zarr \\
    --chunking "time:50,lat:25,lon:50" \\
    --variables "temperature,pressure" \\
    --drop-variables "humidity"

Configuration Files#

Use YAML or JSON configuration files:

YAML Configuration#

# config.yaml
chunking:
  time: 150
  lat: 60
  lon: 120
compression:
  method: blosc:zstd:2
  clevel: 2
packing:
  enabled: true
  bits: 16
variables:
  include:
    - temperature
    - pressure
  exclude:
    - humidity
attrs:
  title: YAML Config Demo
  version: 1.0

Usage:#

from zarrio import ZarrConverter

# Load from YAML file
converter = ZarrConverter.from_config_file("config.yaml")
converter.convert("input.nc", "output.zarr")
# Use with CLI
zarrio convert input.nc output.zarr --config config.yaml

JSON Configuration#

{
  "chunking": {
    "time": 125,
    "lat": 55,
    "lon": 110
  },
  "compression": {
    "method": "blosc:lz4:1",
    "clevel": 1
  },
  "packing": {
    "enabled": true,
    "bits": 8
  },
  "variables": {
    "include": ["temperature", "pressure"],
    "exclude": ["humidity"]
  },
  "attrs": {
    "title": "JSON Config Demo",
    "version": "1.0"
  }
}

Usage:#

from zarrio import ZarrConverter

# Load from JSON file
converter = ZarrConverter.from_config_file("config.json")
converter.convert("input.nc", "output.zarr")
# Use with CLI
zarrio convert input.nc output.zarr --config config.json

Intelligent Chunking#

zarrio provides automatic chunking analysis:

from zarrio import convert_to_zarr

# No chunking specified - automatic analysis
convert_to_zarr(
    "climate_data.nc",
    "climate_data.zarr",
    access_pattern="balanced"  # Optimize for mixed workloads
)

# Temporal analysis optimized
convert_to_zarr(
    "climate_data.nc",
    "climate_data.zarr",
    access_pattern="temporal"  # Optimize for time series analysis
)

# Spatial analysis optimized
convert_to_zarr(
    "climate_data.nc",
    "climate_data.zarr",
    access_pattern="spatial"  # Optimize for spatial analysis
)

Advanced Features#

Retry Logic for Missing Data#

Handle transient issues with automatic retries:

from zarrio import ZarrConverter
from zarrio.models import ZarrConverterConfig

# Configure retries for missing data
config = ZarrConverterConfig(
    retries_on_missing=3,  # Retry up to 3 times
    missing_check_vars="all"
)

converter = ZarrConverter(config=config)
converter.write_region("data.nc", "archive.zarr")

Data Packing with Validation#

Pack data with automatic validation and warnings:

from zarrio import ZarrConverter
from zarrio.models import ZarrConverterConfig, PackingConfig

# Enable data packing with validation
config = ZarrConverterConfig(
    packing=PackingConfig(enabled=True, bits=16)
)

converter = ZarrConverter(config=config)

# Add valid range attributes for packing
ds = xr.open_dataset("input.nc")
ds["temperature"].attrs["valid_min"] = 0.0
ds["temperature"].attrs["valid_max"] = 100.0
ds["pressure"].attrs["valid_min"] = 900.0
ds["pressure"].attrs["valid_max"] = 1100.0
ds.to_netcdf("input_with_valid_range.nc")

# Convert with packing
converter.convert("input_with_valid_range.nc", "output.zarr")

Complete Workflow Example#

Here’s a complete example showing a typical workflow:

import xarray as xr
import numpy as np
import pandas as pd
from zarrio import ZarrConverter
from zarrio.models import (
    ZarrConverterConfig,
    ChunkingConfig,
    PackingConfig,
    CompressionConfig
)

# 1. Create sample data (in practice, this would come from NetCDF files)
def create_sample_data(filename: str, start_date: str, periods: int) -> str:
    """Create sample climate data."""
    times = pd.date_range(start_date, periods=periods)
    lats = np.linspace(-90, 90, 180)
    lons = np.linspace(-180, 180, 360)

    # Create realistic climate data
    np.random.seed(42)  # For reproducible results
    temperature = 20 + 15 * np.sin(2 * np.pi * np.arange(periods) / 365)  # Seasonal cycle
    temperature = temperature[:, np.newaxis, np.newaxis] + 10 * np.random.random([periods, 180, 360])

    pressure = 1013 + 50 * np.random.random([periods, 180, 360])

    # Create dataset
    ds = xr.Dataset(
        {
            "temperature": (("time", "lat", "lon"), temperature),
            "pressure": (("time", "lat", "lon"), pressure * 1000),
        },
        coords={
            "time": times,
            "lat": lats,
            "lon": lons,
        },
    )

    # Add metadata
    ds.attrs["title"] = "Sample Climate Dataset"
    ds.attrs["institution"] = "zarrio Demo"
    ds["temperature"].attrs["units"] = "degC"
    ds["pressure"].attrs["units"] = "hPa"

    # Save as NetCDF
    ds.to_netcdf(filename)
    return filename

# 2. Create sample data files for demonstration
with tempfile.TemporaryDirectory() as tmpdir:
    # Create annual data files
    files = []
    for year in range(2020, 2024):
        ncfile = os.path.join(tmpdir, f"data_{year}.nc")
        create_sample_data(ncfile, f"{year}-01-01", 365)
        files.append(ncfile)

    # 3. Create template for parallel writing
    config = ZarrConverterConfig(
        chunking=ChunkingConfig(time=100, lat=50, lon=100),
        compression=CompressionConfig(method="blosc:zstd:3"),
        packing=PackingConfig(enabled=True, bits=16),
        retries_on_missing=3,
        missing_check_vars="all"
    )

    converter = ZarrConverter(config=config)

    # Create template covering full time range
    zarr_archive = os.path.join(tmpdir, "climate_archive.zarr")
    template_ds = xr.open_dataset(files[0])
    converter.create_template(
        template_dataset=template_ds,
        output_path=zarr_archive,
        global_start="2020-01-01",
        global_end="2023-12-31",
        compute=False
    )

    # 4. Write regions in parallel (simulated)
    for ncfile in files:
        print(f"Writing {os.path.basename(ncfile)}...")
        converter.write_region(ncfile, zarr_archive)

    # 5. Verify the final archive
    final_ds = xr.open_zarr(zarr_archive)
    print(f"Final archive: {len(final_ds.time)} time steps")
    print(f"Variables: {list(final_ds.data_vars.keys())}")

    print("Complete workflow example finished successfully!")

Error Handling#

zarrio provides comprehensive error handling:

from zarrio.exceptions import ConversionError, PackingError, ConfigurationError

try:
    convert_to_zarr("input.nc", "output.zarr")
except ConversionError as e:
    print(f"Conversion failed: {e}")
except PackingError as e:
    print(f"Packing failed: {e}")
except ConfigurationError as e:
    print(f"Configuration error: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")

Performance Optimization#

Optimize for different scenarios:

# For large datasets with limited memory
config = ZarrConverterConfig(
    chunking=ChunkingConfig(time=50, lat=25, lon=50),  # Smaller chunks
    compression=CompressionConfig(method="blosc:zstd:1")  # Lower compression for speed
)

# For high-compression scenarios
config = ZarrConverterConfig(
    chunking=ChunkingConfig(time=200, lat=100, lon=200),  # Larger chunks for better compression
    compression=CompressionConfig(method="blosc:zstd:9")  # Higher compression
    packing=PackingConfig(enabled=True, bits=8)  # 8-bit packing for maximum compression
)

# For parallel processing
config = ZarrConverterConfig(
    chunking=ChunkingConfig(time=100, lat=50, lon=100),  # Balanced chunks
    compression=CompressionConfig(method="blosc:zstd:3"),  # Balanced compression
    packing=PackingConfig(enabled=True, bits=16),  # 16-bit packing for good balance
    retries_on_missing=3  # Enable retries for parallel reliability
)

Logging and Debugging#

Enable detailed logging for debugging:

import logging

# Setup logging
logging.basicConfig(
    level=logging.DEBUG,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)

# Convert with verbose logging
convert_to_zarr("input.nc", "output.zarr")
# CLI with verbose logging
zarrio convert input.nc output.zarr -vvv

The logs will show: - Processing steps - Configuration validation - Chunking analysis - Compression and packing - I/O operations - Performance metrics - Error details

Datamesh Integration#

zarrio supports integration with Oceanum’s Datamesh platform:

from zarrio import ZarrConverter, ZarrConverterConfig

# Configure for datamesh
config = ZarrConverterConfig(
    datamesh={
        "datasource": {
            "id": "my_climate_data",
            "name": "My Climate Data",
            "description": "Climate data converted with zarrio",
            "coordinates": {"x": "lon", "y": "lat", "t": "time"},
            "details": "https://example.com",
            "tags": ["climate", "zarrio", "datamesh"],
        },
        "token": "your_datamesh_token",
        "service": "https://datamesh-v1.oceanum.io",
    },
    chunking={"time": 100, "lat": 50, "lon": 100},
    compression={"method": "blosc:zstd:3"},
)

# Create converter
converter = ZarrConverter(config=config)

# Convert data directly to datamesh (no output_path needed)
converter.convert("input.nc")

CLI Datamesh Integration#

Use the CLI with datamesh:

# Convert to datamesh datasource
zarrio convert input.nc \
  --datamesh-datasource '{"id":"my_climate_data","name":"My Climate Data","coordinates":{"x":"lon","y":"lat","t":"time"}}' \
  --datamesh-token $DATAMESH_TOKEN

# Create template for parallel writing
zarrio create-template template.nc \
  --datamesh-datasource '{"id":"my_climate_data","name":"My Climate Data","coordinates":{"x":"lon","y":"lat","t":"time"}}' \
  --datamesh-token $DATAMESH_TOKEN \
  --global-start 2023-01-01 \
  --global-end 2023-12-31

# Write region to datamesh datasource
zarrio write-region data.nc \
  --datamesh-datasource '{"id":"my_climate_data","name":"My Climate Data","coordinates":{"x":"lon","y":"lat","t":"time"}}' \
  --datamesh-token $DATAMESH_TOKEN

Configuration File with Datamesh#

Use YAML configuration with datamesh:

# config.yaml
chunking:
  time: 100
  lat: 50
  lon: 100
compression:
  method: blosc:zstd:3
datamesh:
  datasource:
    id: my_climate_data
    name: My Climate Data
    description: Climate data converted with zarrio
    coordinates:
      x: lon
      y: lat
      t: time
    details: https://example.com
    tags:
      - climate
      - zarrio
      - datamesh
  token: your_datamesh_token
  service: https://datamesh-v1.oceanum.io
# Load from YAML file
converter = ZarrConverter.from_config_file("config.yaml")
converter.convert("input.nc")
# Use with CLI
zarrio convert input.nc --config config.yaml