Data Packing#

zarrio provides advanced data packing functionality to reduce storage requirements and improve I/O performance by compressing floating-point data using fixed-scale offset encoding.

Overview#

Data packing converts floating-point data to integer representations with a fixed scale and offset, significantly reducing storage size while maintaining reasonable precision. This is particularly useful for climate and weather data where full 64-bit precision is often unnecessary.

The packing process uses the formula:

packed_value = (original_value - offset) / scale

Where scale and offset are computed based on the data range to optimally fit within the specified bit width.

Configuration#

Packing can be configured through the PackingConfig model:

from zarrio.models import PackingConfig

packing = PackingConfig(
    enabled=True,
    bits=16
)

The PackingConfig supports the following fields:

  • enabled: Whether to enable data packing (default: False)

  • bits: Number of bits for packing (8, 16, or 32) (default: 16)

  • manual_ranges: Manual min/max ranges for variables (default: None)

  • auto_buffer_factor: Buffer factor for automatically calculated ranges (default: 0.01)

  • check_range_exceeded: Whether to check if data exceeds specified ranges (default: True)

  • range_exceeded_action: Action when data exceeds range (‘warn’, ‘error’, ‘ignore’) (default: ‘warn’)

Enhanced Packing Features#

zarrio’s enhanced packing functionality provides several improvements over basic packing:

Priority-Based Range Determination#

The enhanced packing system uses a clear priority order for determining the min/max values used for packing:

  1. Manual ranges (if provided)

  2. Variable attributes (valid_min/valid_max)

  3. Automatic calculation from data (with warnings)

Manual Range Specification#

Users can explicitly specify min/max ranges for variables:

from zarrio.models import PackingConfig

packing = PackingConfig(
    enabled=True,
    bits=16,
    manual_ranges={
        "temperature": {"min": -50, "max": 50},
        "pressure": {"min": 90000, "max": 110000}
    }
)

Automatic Range Calculation with Buffer#

When no ranges are provided, the system automatically calculates them from the data:

from zarrio.models import PackingConfig

packing = PackingConfig(
    enabled=True,
    bits=16,
    auto_buffer_factor=0.05  # 5% buffer
)

Range Exceeded Validation#

Optional checking to ensure data doesn’t exceed specified ranges:

from zarrio.models import PackingConfig

packing = PackingConfig(
    enabled=True,
    bits=16,
    manual_ranges={"temperature": {"min": -50, "max": 50}},
    check_range_exceeded=True,
    range_exceeded_action="error"  # or "warn" or "ignore"
)

Usage Examples#

Functional API#

from zarrio import convert_to_zarr

# Basic packing
convert_to_zarr(
    "input.nc",
    "output.zarr",
    packing=True,
    packing_bits=16
)

# Packing with manual ranges
convert_to_zarr(
    "input.nc",
    "output.zarr",
    packing=True,
    packing_bits=16,
    packing_manual_ranges={
        "temperature": {"min": -50, "max": 50},
        "pressure": {"min": 90000, "max": 110000}
    }
)

# Packing with automatic range calculation
convert_to_zarr(
    "input.nc",
    "output.zarr",
    packing=True,
    packing_bits=16,
    packing_auto_buffer_factor=0.05
)

Class-Based API#

from zarrio import ZarrConverter
from zarrio.models import PackingConfig

# Programmatic configuration
packing_config = PackingConfig(
    enabled=True,
    bits=16,
    manual_ranges={
        "temperature": {"min": -50, "max": 50}
    }
)

converter = ZarrConverter(packing=packing_config)
converter.convert("input.nc", "output.zarr")

Command Line Interface#

# Basic packing
zarrio convert input.nc output.zarr --packing --packing-bits 16

# Packing with manual ranges
zarrio convert input.nc output.zarr --packing \\
    --packing-manual-ranges '{"temperature": {"min": -50, "max": 50}}'

# Packing with automatic range calculation
zarrio convert input.nc output.zarr --packing \\
    --packing-auto-buffer-factor 0.05

Configuration Files#

YAML:

# config.yaml
packing:
  enabled: true
  bits: 16
  manual_ranges:
    temperature:
      min: -50
      max: 50
    pressure:
      min: 90000
      max: 110000
  auto_buffer_factor: 0.05
  check_range_exceeded: true
  range_exceeded_action: warn

JSON:

{
  "packing": {
    "enabled": true,
    "bits": 16,
    "manual_ranges": {
      "temperature": {
        "min": -50,
        "max": 50
      }
    },
    "auto_buffer_factor": 0.05,
    "check_range_exceeded": true,
    "range_exceeded_action": "warn"
  }
}

Best Practices#

  1. Use Manual Ranges When Possible: If you know the valid range of your data, specify it manually for optimal packing.

  2. Consider Data Distribution: For data with non-uniform distributions, manual ranges may provide better precision.

  3. Monitor Range Exceeded Warnings: Pay attention to warnings about data exceeding specified ranges.

  4. Choose Appropriate Bit Width: - 8 bits: High compression, lower precision - 16 bits: Good balance of compression and precision - 32 bits: Higher precision, lower compression

  5. Use Buffer for Automatic Ranges: When using automatic range calculation, add a buffer to account for future data.

  6. Validate Your Data: Use the range exceeded checking feature to catch data anomalies.

Warning System#

The enhanced packing system provides informative warnings in various scenarios:

  • When manual ranges override existing attributes

  • When automatically calculating ranges (with note about potential inaccuracy for region-based archives)

  • When data exceeds specified ranges

These warnings help ensure data integrity and inform users about potential issues.

Technical Details#

The packing implementation uses zarr’s FixedScaleOffset codec to perform the actual compression. The Packer class handles the computation of scale and offset parameters based on the configured ranges.

For region-based archives written over time, automatically calculated ranges may be inaccurate since they’re based only on the current region’s data. Manual ranges or attribute-based ranges are recommended for these scenarios.