Analysis Tool#
The analyze command in zarrio provides comprehensive analysis of NetCDF files to help users optimize their Zarr conversion process. It examines the dataset and provides recommendations for chunking, packing, and compression options.
Usage#
Basic analysis:
zarrio analyze input.nc
Analysis with performance testing:
zarrio analyze input.nc --test-performance
Analysis with custom target chunk size:
zarrio analyze input.nc --target-chunk-size-mb 100
Interactive configuration setup:
zarrio analyze input.nc --interactive
Features#
Dataset Information#
The analysis tool provides detailed information about the dataset:
Dimensions and their sizes
Variables and their data types
Coordinates
Size estimates for each variable and the total dataset
Chunking Analysis#
The tool analyzes the dataset and provides recommendations for three access patterns:
Temporal Access Pattern: Optimized for time series analysis
Spatial Access Pattern: Optimized for spatial analysis
Balanced Access Pattern: Good for mixed workloads
For each pattern, it recommends:
Chunk sizes for each dimension
Estimated chunk size in MB
Notes about the optimization strategy
Packing Analysis#
The analysis identifies:
Variables that already have valid_min/valid_max attributes
Variables that are missing valid range attributes
Recommendations for adding attributes for optimal packing
Compression Analysis#
The tool lists common compression options with their characteristics:
blosc:zstd:1 - Fast compression, good balance
blosc:zstd:3 - Higher compression, slower
blosc:lz4:1 - Very fast compression
blosc:lz4:3 - Higher compression, slower
Performance Testing#
When using the --test-performance
flag, the tool analyzes the data characteristics and provides theoretical benefits of different compression and packing options:
Theoretical Benefits: Calculates potential size reductions based on data types
Typical Compression Ratios: Shows empirical compression ratios for common options
Performance Considerations: Explains trade-offs between compression and performance
Recommendations: Advises using
--run-tests
for real-world measurements
When using the --run-tests
flag, the tool runs actual conversion tests on a subset of the data to measure real-world benefits:
No compression, no packing: Baseline for comparison
Packing 16-bit: 16-bit packing for floating-point data
Packing 8-bit: 8-bit packing for maximum compression
Blosc Zstd Level 1: Fast compression with good ratio
Blosc Zstd Level 3: Higher compression, slower
Blosc LZ4 Level 1: Very fast compression
Packing + Blosc Zstd: Combined packing and compression
For each configuration, the tool measures:
Output size: Actual disk space used
Processing time: Time taken to perform the conversion
Compression ratio: Size reduction compared to baseline
Performance impact: Time increase compared to baseline
Example output (Theoretical):
Performance Analysis (Theoretical):
-----------------------------------
Analyzing compression and packing for variable: temperature
Original data: 2.45 MB (float64)
Theoretical Benefits:
-------------------
16-bit packing: 1.23 MB (2.0x smaller)
8-bit packing: 0.62 MB (4.0x smaller)
Typical Compression Ratios:
-------------------------
Blosc Zstd Level 1: 2-3x smaller (fast)
Blosc Zstd Level 3: 3-5x smaller (slower)
Blosc LZ4 Level 1: 2-2.5x smaller (very fast)
Blosc LZ4 Level 3: 2.5-3.5x smaller (fast)
Packing + Blosc Zstd: 5-10x smaller (combined benefits)
Performance Considerations:
-------------------------
Packing adds CPU overhead during conversion
Compression adds CPU overhead during read/write
Higher compression levels = more CPU overhead
Smaller chunks = more metadata overhead
Larger chunks = more memory usage during processing
Recommendation: Use --run-tests to measure real-world performance
for your specific data and use case.
Example output (Actual):
Performance Testing (Actual):
----------------------------
Testing compression and packing on variable: temperature
No compression, no packing: 2.45 MB in 1.23s
Packing 16-bit: 1.23 MB in 1.45s
Packing 8-bit: 0.62 MB in 1.67s
Blosc Zstd Level 1: 0.85 MB in 2.34s
Blosc Zstd Level 3: 0.65 MB in 3.45s
Blosc LZ4 Level 1: 0.92 MB in 1.78s
Packing + Blosc Zstd: 0.45 MB in 2.67s
Performance Comparison:
----------------------
No compression, no packing:
Size: 2.45 MB (1.0x smaller)
Time: 1.23s (1.0x slower)
Packing 16-bit:
Size: 1.23 MB (2.0x smaller)
Time: 1.45s (1.2x slower)
Packing 8-bit:
Size: 0.62 MB (4.0x smaller)
Time: 1.67s (1.4x slower)
Blosc Zstd Level 1:
Size: 0.85 MB (2.9x smaller)
Time: 2.34s (1.9x slower)
Blosc Zstd Level 3:
Size: 0.65 MB (3.8x smaller)
Time: 3.45s (2.8x slower)
Blosc LZ4 Level 1:
Size: 0.92 MB (2.7x smaller)
Time: 1.78s (1.4x slower)
Packing + Blosc Zstd:
Size: 0.45 MB (5.4x smaller)
Time: 2.67s (2.2x slower)
Interactive Mode#
When using the –interactive flag, the tool guides users through setting up a configuration:
Chunking Configuration: Select from recommended access patterns or specify custom chunks
Packing Configuration: Enable packing, choose bit width, and handle variables without valid ranges
Compression Configuration: Select from common compression options
Configuration Export: Save the configuration to a YAML file
Example Output#
zarrio Analysis Tool
==================================================
Analyzing file: sample.nc
Loading dataset...
Dataset loaded successfully!
Dataset Information:
--------------------
Dimensions: {'time': 100, 'lat': 180, 'lon': 360}
Variables: ['temperature', 'pressure']
Coordinates: ['time', 'lat', 'lon']
Data Type Information:
--------------------
temperature: float64 (8 bytes/element)
Shape: (100, 180, 360)
Size estimate: 49.44 MB
pressure: float64 (8 bytes/element)
Shape: (100, 180, 360)
Size estimate: 49.44 MB
Total dataset size estimate: 98.88 MB
Chunking Analysis:
----------------
Temporal Access Pattern:
Recommended chunks: {'time': 10, 'lat': 180, 'lon': 360}
Estimated chunk size: 2.47 MB
Notes: Optimized for time series analysis
Spatial Access Pattern:
Recommended chunks: {'time': 5, 'lat': 180, 'lon': 360}
Estimated chunk size: 1.24 MB
Notes: Optimized for spatial analysis
Balanced Access Pattern:
Recommended chunks: {'time': 10, 'lat': 180, 'lon': 360}
Estimated chunk size: 2.47 MB
Notes: Balanced for mixed access patterns
Packing Analysis:
----------------
Variables with valid range attributes: ['temperature']
Variables without valid range attributes: ['pressure']
Recommendation: Consider adding valid_min/valid_max attributes to variables
for optimal packing.
Compression Analysis:
--------------------
Common compression options:
1. blosc:zstd:1 - Fast compression, good balance
2. blosc:zstd:3 - Higher compression, slower
3. blosc:lz4:1 - Very fast compression
4. blosc:lz4:3 - Higher compression, slower
Performance Testing:
------------------
Testing compression and packing on variable: temperature
No compression, no packing: 2.45 MB in 1.23s
Packing 16-bit: 1.23 MB in 1.45s
Packing 8-bit: 0.62 MB in 1.67s
Blosc Zstd Level 1: 0.85 MB in 2.34s
Blosc Zstd Level 3: 0.65 MB in 3.45s
Blosc LZ4 Level 1: 0.92 MB in 1.78s
Packing + Blosc Zstd: 0.45 MB in 2.67s
Performance Comparison:
----------------------
No compression, no packing:
Size: 2.45 MB (1.0x smaller)
Time: 1.23s (1.0x slower)
Packing 16-bit:
Size: 1.23 MB (2.0x smaller)
Time: 1.45s (1.2x slower)
Packing 8-bit:
Size: 0.62 MB (4.0x smaller)
Time: 1.67s (1.4x slower)
Blosc Zstd Level 1:
Size: 0.85 MB (2.9x smaller)
Time: 2.34s (1.9x slower)
Blosc Zstd Level 3:
Size: 0.65 MB (3.8x smaller)
Time: 3.45s (2.8x slower)
Blosc LZ4 Level 1:
Size: 0.92 MB (2.7x smaller)
Time: 1.78s (1.4x slower)
Packing + Blosc Zstd:
Size: 0.45 MB (5.4x smaller)
Time: 2.67s (2.2x slower)
Best Practices#
Use Interactive Mode: For new users, the interactive mode provides guided setup
Consider Access Patterns: Choose the access pattern that matches your primary use case
Add Valid Ranges: Add valid_min/valid_max attributes to variables for optimal packing
Test Compression: Experiment with different compression options for your data
Use Performance Testing: Run performance tests to see real-world benefits
Save Configurations: Save recommended configurations for reuse
Environment-Specific Recommendations#
Target chunk sizes can be optimized for different environments:
Local Development: 10-25 MB chunks
Production Servers: 50-100 MB chunks
Cloud Environments: 100-200 MB chunks
Use the –target-chunk-size-mb option to customize recommendations for your environment.