Data Packing#
zarrio provides advanced data packing functionality to reduce storage requirements and improve I/O performance by compressing floating-point data using fixed-scale offset encoding.
Overview#
Data packing converts floating-point data to integer representations with a fixed scale and offset, significantly reducing storage size while maintaining reasonable precision. This is particularly useful for climate and weather data where full 64-bit precision is often unnecessary.
The packing process uses the formula:
packed_value = (original_value - offset) / scale
Where scale
and offset
are computed based on the data range to optimally fit within the specified bit width.
Configuration#
Packing can be configured through the PackingConfig
model:
from zarrio.models import PackingConfig
packing = PackingConfig(
enabled=True,
bits=16
)
The PackingConfig
supports the following fields:
enabled: Whether to enable data packing (default: False)
bits: Number of bits for packing (8, 16, or 32) (default: 16)
manual_ranges: Manual min/max ranges for variables (default: None)
auto_buffer_factor: Buffer factor for automatically calculated ranges (default: 0.01)
check_range_exceeded: Whether to check if data exceeds specified ranges (default: True)
range_exceeded_action: Action when data exceeds range (‘warn’, ‘error’, ‘ignore’) (default: ‘warn’)
Enhanced Packing Features#
zarrio’s enhanced packing functionality provides several improvements over basic packing:
Priority-Based Range Determination#
The enhanced packing system uses a clear priority order for determining the min/max values used for packing:
Manual ranges (if provided)
Variable attributes (valid_min/valid_max)
Automatic calculation from data (with warnings)
Manual Range Specification#
Users can explicitly specify min/max ranges for variables:
from zarrio.models import PackingConfig
packing = PackingConfig(
enabled=True,
bits=16,
manual_ranges={
"temperature": {"min": -50, "max": 50},
"pressure": {"min": 90000, "max": 110000}
}
)
Automatic Range Calculation with Buffer#
When no ranges are provided, the system automatically calculates them from the data:
from zarrio.models import PackingConfig
packing = PackingConfig(
enabled=True,
bits=16,
auto_buffer_factor=0.05 # 5% buffer
)
Range Exceeded Validation#
Optional checking to ensure data doesn’t exceed specified ranges:
from zarrio.models import PackingConfig
packing = PackingConfig(
enabled=True,
bits=16,
manual_ranges={"temperature": {"min": -50, "max": 50}},
check_range_exceeded=True,
range_exceeded_action="error" # or "warn" or "ignore"
)
Usage Examples#
Functional API#
from zarrio import convert_to_zarr
# Basic packing
convert_to_zarr(
"input.nc",
"output.zarr",
packing=True,
packing_bits=16
)
# Packing with manual ranges
convert_to_zarr(
"input.nc",
"output.zarr",
packing=True,
packing_bits=16,
packing_manual_ranges={
"temperature": {"min": -50, "max": 50},
"pressure": {"min": 90000, "max": 110000}
}
)
# Packing with automatic range calculation
convert_to_zarr(
"input.nc",
"output.zarr",
packing=True,
packing_bits=16,
packing_auto_buffer_factor=0.05
)
Class-Based API#
from zarrio import ZarrConverter
from zarrio.models import PackingConfig
# Programmatic configuration
packing_config = PackingConfig(
enabled=True,
bits=16,
manual_ranges={
"temperature": {"min": -50, "max": 50}
}
)
converter = ZarrConverter(packing=packing_config)
converter.convert("input.nc", "output.zarr")
Command Line Interface#
# Basic packing
zarrio convert input.nc output.zarr --packing --packing-bits 16
# Packing with manual ranges
zarrio convert input.nc output.zarr --packing \\
--packing-manual-ranges '{"temperature": {"min": -50, "max": 50}}'
# Packing with automatic range calculation
zarrio convert input.nc output.zarr --packing \\
--packing-auto-buffer-factor 0.05
Configuration Files#
YAML:
# config.yaml
packing:
enabled: true
bits: 16
manual_ranges:
temperature:
min: -50
max: 50
pressure:
min: 90000
max: 110000
auto_buffer_factor: 0.05
check_range_exceeded: true
range_exceeded_action: warn
JSON:
{
"packing": {
"enabled": true,
"bits": 16,
"manual_ranges": {
"temperature": {
"min": -50,
"max": 50
}
},
"auto_buffer_factor": 0.05,
"check_range_exceeded": true,
"range_exceeded_action": "warn"
}
}
Best Practices#
Use Manual Ranges When Possible: If you know the valid range of your data, specify it manually for optimal packing.
Consider Data Distribution: For data with non-uniform distributions, manual ranges may provide better precision.
Monitor Range Exceeded Warnings: Pay attention to warnings about data exceeding specified ranges.
Choose Appropriate Bit Width: - 8 bits: High compression, lower precision - 16 bits: Good balance of compression and precision - 32 bits: Higher precision, lower compression
Use Buffer for Automatic Ranges: When using automatic range calculation, add a buffer to account for future data.
Validate Your Data: Use the range exceeded checking feature to catch data anomalies.
Warning System#
The enhanced packing system provides informative warnings in various scenarios:
When manual ranges override existing attributes
When automatically calculating ranges (with note about potential inaccuracy for region-based archives)
When data exceeds specified ranges
These warnings help ensure data integrity and inform users about potential issues.
Technical Details#
The packing implementation uses zarr’s FixedScaleOffset
codec to perform the actual compression. The Packer
class handles the computation of scale and offset parameters based on the configured ranges.
For region-based archives written over time, automatically calculated ranges may be inaccurate since they’re based only on the current region’s data. Manual ranges or attribute-based ranges are recommended for these scenarios.