Chunking Strategies
====================

zarrio provides intelligent chunking analysis and recommendations to optimize performance for different access patterns. This document explains both the conceptual approach and detailed mathematical calculations used for each access pattern.

Understanding Chunking
------------------------

Chunking is the process of dividing large datasets into smaller, more manageable pieces for efficient storage and retrieval. In Zarr format, chunks are the fundamental unit of storage and access.

Why Chunking Matters:
~~~~~~~~~~~~~~~~~~~~~

1. **I/O Performance**: Chunks that align with your access patterns improve performance
2. **Memory Usage**: Smaller chunks reduce memory requirements
3. **Compression**: Larger chunks often compress better
4. **Parallel Processing**: Chunks are the unit of parallelization
5. **Network Transfer**: Appropriate chunk sizes optimize network I/O

Chunk Size Considerations:
~~~~~~~~~~~~~~~~~~~~~~~~~~

- **Too Small**: Increases metadata overhead and reduces compression efficiency
- **Too Large**: Increases memory usage and reduces parallelization benefits
- **Just Right**: Balances all factors for optimal performance

Recommended Chunk Sizes:
~~~~~~~~~~~~~~~~~~~~~~~~~

- **Minimum**: 1 MB per chunk
- **Target**: 10-100 MB per chunk
- **Maximum**: 100 MB per chunk

Configurable Target Chunk Size
-------------------------------

zarrio allows you to configure the target chunk size for different environments:

.. code-block:: python

    from zarrio.chunking import get_chunk_recommendation

    # Configure target chunk size (default is 50 MB)
    recommendation = get_chunk_recommendation(
        dimensions={"time": 1000, "lat": 500, "lon": 1000},
        dtype_size_bytes=4,
        access_pattern="balanced",
        target_chunk_size_mb=100  # 100 MB target chunks
    )

Environment-specific recommendations:
- **Local development**: 10-25 MB chunks
- **Production servers**: 50-100 MB chunks
- **Cloud environments**: 100-200 MB chunks

Configuration Methods:
~~~~~~~~~~~~~~~~~~~~~~

1. **Function Arguments**:
   ``get_chunk_recommendation(..., target_chunk_size_mb=100)``

2. **Environment Variables**:
   ``ZARRIFY_TARGET_CHUNK_SIZE_MB=200``

3. **ZarrConverter Configuration**:
   .. code-block:: python
   
       from zarrio.models import ZarrConverterConfig
       
       config = ZarrConverterConfig(target_chunk_size_mb=100)

Intelligent Chunking Analysis
-------------------------------

zarrio automatically analyzes your dataset and provides chunking recommendations based on expected access patterns. The system performs detailed mathematical calculations to optimize chunk sizes for your specific data characteristics.

When creating templates for parallel processing with global start and end times, zarrio can perform intelligent chunking based on the full archive dimensions rather than just the template dataset. This ensures optimal chunking for the entire archive.

The system analyzes:
- Dataset dimensions and sizes
- Data type and element size
- Expected access patterns
- Storage characteristics

Detailed calculations for each access pattern are explained in the Access Pattern Optimization section below.

Access Pattern Optimization
----------------------------

Different access patterns require different chunking strategies. The calculations for each pattern are detailed below:

Temporal Analysis
~~~~~~~~~~~~~~~~~~

Optimized for time series extraction (specific locations over long time periods):

.. code-block:: python

    # Temporal analysis optimized
    temporal_config = ZarrConverterConfig(
        chunking=ChunkingConfig(
            time=100,   # Large time chunks (e.g., 100 time steps)
            lat=30,     # Smaller spatial chunks
            lon=60      # Smaller spatial chunks
        ),
        attrs={"access_pattern": "temporal_analysis"}
    )

    converter = ZarrConverter(config=temporal_config)
    converter.convert("climate_data.nc", "temporal_archive.zarr")

Benefits:
- Fewer I/O operations for long time series
- Efficient access to temporal data
- Good compression for time-aligned data

How it works:
- Calculates large time chunks (~10% of total time steps, capped at 100)
- Distributes remaining space across spatial dimensions with a minimum of 10 elements per dimension
- Optimizes for extracting long time series at specific locations

Detailed Calculation:
The temporal focus algorithm uses the following mathematical approach:

1. **Time Chunk Calculation**:
   ``time_chunk = min(100, max(10, time_dimension_size // 10))``

2. **Spatial Chunk Calculation**:
   ``target_elements = (target_chunk_size_mb * 1024²) / dtype_size_bytes``
   ``spatial_elements_per_dim = target_elements / time_chunk``
   ``spatial_chunk_per_dim = (spatial_elements_per_dim)^(1/num_spatial_dims)``
   ``spatial_chunk = max(10, min(spatial_chunk_per_dim, spatial_dimension_size))``

Example:
For a dataset with 1000 time steps, 180 lat points, 360 lon points:
- time_chunk = min(100, max(10, 1000//10)) = 100
- spatial_elements_per_dim = (50 * 1024² / 4) / 100 = 131,072
- spatial_chunk_per_dim = √131,072 ≈ 362
- lat_chunk = min(362, 180) = 180
- lon_chunk = min(362, 360) = 360

Result: time=100, lat=180, lon=360 chunks

Spatial Analysis
~~~~~~~~~~~~~~~~~~~

Optimized for spatial analysis (maps at specific times):

.. code-block:: python

    # Spatial analysis optimized
    spatial_config = ZarrConverterConfig(
        chunking=ChunkingConfig(
            time=20,    # Smaller time chunks
            lat=100,    # Large spatial chunks
            lon=100     # Large spatial chunks
        ),
        attrs={"access_pattern": "spatial_analysis"}
    )

    converter = ZarrConverter(config=spatial_config)
    converter.convert("climate_data.nc", "spatial_archive.zarr")

Benefits:
- Efficient spatial subsetting
- Better cache locality for spatial operations
- Optimized for map generation

How it works:
- Calculates small time chunks (~2% of total time steps, capped at 20)
- Distributes remaining space across spatial dimensions with a minimum of 50 elements per dimension
- Optimizes for extracting spatial maps at specific time steps

Detailed Calculation:
The spatial focus algorithm uses the following mathematical approach:

1. **Time Chunk Calculation**:
   ``time_chunk = min(20, max(5, time_dimension_size // 50))``

2. **Spatial Chunk Calculation**:
   ``target_elements = (target_chunk_size_mb * 1024²) / dtype_size_bytes``
   ``spatial_elements_per_dim = target_elements / time_chunk``
   ``spatial_chunk_per_dim = (spatial_elements_per_dim)^(1/num_spatial_dims)``
   ``spatial_chunk = max(50, min(spatial_chunk_per_dim, spatial_dimension_size))``

Example:
For a dataset with 365 time steps, 720 lat points, 1440 lon points:
- time_chunk = min(20, max(5, 365//50)) = min(20, 7) = 7
- spatial_elements_per_dim = (50 * 1024² / 4) / 7 = 1,872,457
- spatial_chunk_per_dim = √1,872,457 ≈ 1,368
- lat_chunk = min(1,368, 720) = 720
- lon_chunk = min(1,368, 1440) = 1,368

Result: time=7, lat=720, lon=1368 chunks

Balanced Approach
~~~~~~~~~~~~~~~~~~~~

Good performance for mixed workloads:

.. code-block:: python

    # Balanced approach
    balanced_config = ZarrConverterConfig(
        chunking=ChunkingConfig(
            time=50,    # Moderate time chunks
            lat=50,     # Moderate spatial chunks
            lon=50      # Moderate spatial chunks
        ),
        attrs={"access_pattern": "balanced"}
    )

    converter = ZarrConverter(config=balanced_config)
    converter.convert("climate_data.nc", "balanced_archive.zarr")

Benefits:
- Reasonable performance for diverse access patterns
- Good compromise between temporal and spatial access
- Suitable for exploratory analysis

How it works:
- Calculates moderate time chunks (~5% of total time steps, capped at 50)
- Distributes remaining space across spatial dimensions with a minimum of 30 elements per dimension
- Provides a balanced approach for mixed access patterns

Detailed Calculation:
The balanced approach algorithm uses the following mathematical approach:

1. **Time Chunk Calculation**:
   ``time_chunk = min(50, max(10, time_dimension_size // 20))``

2. **Spatial Chunk Calculation**:
   ``target_elements = (target_chunk_size_mb * 1024²) / dtype_size_bytes``
   ``spatial_elements_per_dim = target_elements / time_chunk``
   ``spatial_chunk_per_dim = (spatial_elements_per_dim)^(1/num_spatial_dims)``
   ``spatial_chunk = max(30, min(spatial_chunk_per_dim, spatial_dimension_size))``

Example:
For a dataset with 1825 time steps, 360 lat points, 720 lon points:
- time_chunk = min(50, max(10, 1825//20)) = min(50, 91) = 50
- spatial_elements_per_dim = (50 * 1024² / 4) / 50 = 262,144
- spatial_chunk_per_dim = √262,144 ≈ 512
- lat_chunk = min(512, 360) = 360
- lon_chunk = min(512, 720) = 512

Result: time=50, lat=360, lon=512 chunks

Special Cases
~~~~~~~~~~~~~

No Time Dimension:
When no time dimension is detected, the system distributes chunks evenly across all dimensions:
``elements_per_dim = (target_elements)^(1/num_dimensions)``
``chunk_size = min(50, max(10, elements_per_dim))`` for balanced/normal access
``chunk_size = min(100, max(30, elements_per_dim))`` for spatial focus

Single Dimension:
For single-dimensional datasets, the system creates chunks of approximately the target size:
``chunk_size = min(target_elements, dimension_size)``

Validation Rules
~~~~~~~~~~~~~~~~

The system validates all chunking recommendations and flags issues:

1. **Small Chunk Warning**: Chunks < 1 MB may cause metadata overhead
2. **Large Chunk Warning**: Chunks > 100 MB may cause memory issues
3. **Dimension Mismatch**: Chunk sizes larger than dimensions are clipped
4. **Inefficient Chunking**: Very small chunks in large dimensions trigger recommendations

Configuration Options
~~~~~~~~~~~~~~~~~~~~~

Target Chunk Size:
You can configure the target chunk size in multiple ways:

1. **Function Parameter**:
   ``get_chunk_recommendation(..., target_chunk_size_mb=100)``

2. **Environment Variable**:
   ``ZARRIFY_TARGET_CHUNK_SIZE_MB=200``

3. **ZarrConverter Configuration**:
   .. code-block:: python
   
       config = ZarrConverterConfig(target_chunk_size_mb=100)

Environment-Specific Recommendations:
- **Local development**: 10-25 MB chunks
- **Production servers**: 50-100 MB chunks
- **Cloud environments**: 100-200 MB chunks

Chunking Recommendations by Resolution
---------------------------------------

Low Resolution (1° or coarser)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    # For low-resolution global data (e.g., 1° global daily)
    low_res_config = ZarrConverterConfig(
        chunking=ChunkingConfig(
            time=200,   # Larger time chunks acceptable
            lat=90,     # Latitude chunks (~1° per chunk)
            lon=180     # Longitude chunks (~1° per chunk)
        )
    )

Medium Resolution (0.25° to 1°)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    # For medium-resolution regional data (e.g., 0.25° regional daily)
    med_res_config = ZarrConverterConfig(
        chunking=ChunkingConfig(
            time=100,   # Moderate time chunks
            lat=50,     # Latitude chunks (~1.8° per chunk)
            lon=100     # Longitude chunks (~1.8° per chunk)
        )
    )

High Resolution (0.1° or finer)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    # For high-resolution local data (e.g., 0.1° local hourly)
    high_res_config = ZarrConverterConfig(
        chunking=ChunkingConfig(
            time=50,    # Smaller time chunks to limit size
            lat=25,     # Latitude chunks (~2.5° per chunk)
            lon=50      # Longitude chunks (~5° per chunk)
        )
    )

Chunking Validation
---------------------

zarrio validates user-provided chunking and provides warnings for suboptimal configurations:

.. code-block:: python

    # Suboptimal chunking that will generate warnings
    bad_config = ZarrConverterConfig(
        chunking=ChunkingConfig(
            time=1000,  # Too large
            lat=1,      # Too small
            lon=1       # Too small
        )
    )

    converter = ZarrConverter(config=bad_config)
    # Logs warnings about:
    # - Chunk size (3.8 MB) exceeds maximum recommended size (100 MB)
    # - Chunk size (0.0 MB) is below minimum recommended size (1 MB)

Best Practices
----------------

1. **Match Access Patterns**: Align chunks with your typical data access
2. **Consider Compression**: Larger chunks often compress better
3. **Balance Chunk Count**: Too many chunks increase metadata overhead
4. **Memory Constraints**: Ensure chunks fit comfortably in memory
5. **Storage Backend**: Consider characteristics of your storage system

Example Best Practices:
~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    # Good chunking for climate data
    climate_config = ZarrConverterConfig(
        chunking=ChunkingConfig(
            time=100,   # About 3-4 months of daily data
            lat=50,     # About 50° of latitude
            lon=100     # About 100° of longitude
        )
    )

    # Good chunking for high-resolution data
    high_res_config = ZarrConverterConfig(
        chunking=ChunkingConfig(
            time=24,    # About 1 day of hourly data
            lat=25,     # About 2.5° of latitude
            lon=50      # About 5° of longitude
        )
    )

    # Good chunking for very large dimensions
    large_dim_config = ZarrConverterConfig(
        chunking=ChunkingConfig(
            time=50,    # Balance I/O and memory
            lat=100,    # Larger chunks for better compression
            lon=100     # Larger chunks for better compression
        )
    )

Chunking with Parallel Processing
-----------------------------------

When using parallel processing, consider chunking that aligns with your parallelization strategy:

.. code-block:: python

    # Template creation with chunking
    converter = ZarrConverter(
        config=ZarrConverterConfig(
            chunking=ChunkingConfig(
                time=100,   # Large time chunks for temporal analysis
                lat=50,     # Moderate spatial chunks
                lon=100     # Moderate spatial chunks
            )
        )
    )

    # Create template for parallel writing
    converter.create_template(
        template_dataset=template_ds,
        output_path="parallel_archive.zarr",
        global_start="2020-01-01",
        global_end="2023-12-31",
        compute=False
    )

    # Each parallel process writes to different time regions
    # but with the same chunking strategy for consistency

Configuration File Example
----------------------------

YAML configuration with chunking recommendations:

.. code-block:: yaml

    # config.yaml
    chunking:
      time: 150      # About 5 months of daily data
      lat: 60        # About 60° of latitude
      lon: 120       # About 120° of longitude
    compression:
      method: blosc:zstd:2
      clevel: 2
    packing:
      enabled: true
      bits: 16
    variables:
      include:
        - temperature
        - pressure
        - humidity
      exclude:
        - unused_var
    attrs:
      title: YAML Config with Chunking
      version: 1.0
      access_pattern: balanced
    time:
      dim: time
      append_dim: time
      global_start: "2020-01-01"
      global_end: "2023-12-31"
      freq: "1D"

CLI Usage
---------

Command-line interface with chunking:

.. code-block:: bash

    # Convert with chunking
    zarrio convert input.nc output.zarr \
        --chunking "time:100,lat:50,lon:100"

    # Convert with configuration file
    zarrio convert input.nc output.zarr \
        --config config.yaml

    # Convert with automatic analysis
    zarrio convert input.nc output.zarr \
        --access-pattern balanced

Performance Monitoring
------------------------

Monitor chunking performance:

.. code-block:: python

    import logging

    # Enable performance logging
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
    )

    # Convert with verbose logging
    converter = ZarrConverter(
        config=ZarrConverterConfig(
            chunking=ChunkingConfig(time=100, lat=50, lon=100)
        )
    )

    converter.convert("input.nc", "output.zarr")

The logs will show:
- Chunk size information
- Compression ratios
- I/O performance metrics
- Memory usage statistics

Troubleshooting
----------------

Common chunking issues and solutions:

1. **Poor Performance**: Check if chunks align with access patterns
2. **Memory Issues**: Reduce chunk sizes
3. **Metadata Overhead**: Increase chunk sizes
4. **Compression Problems**: Adjust chunk sizes for better ratios

Example Troubleshooting:
~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    # Enable debug logging for chunking analysis
    import logging
    logging.basicConfig(level=logging.DEBUG)

    # Convert with verbose chunking analysis
    convert_to_zarr(
        "input.nc",
        "output.zarr",
        access_pattern="balanced",
        chunking={"time": 100, "lat": 50, "lon": 100}
    )

This will provide detailed information about:
- Chunk size calculations
- Memory usage estimates
- Compression effectiveness
- Performance recommendations

Practical Examples by Resolution
--------------------------------

The following examples show how the chunking calculations work with different data resolutions:

Low Resolution (1° or coarser)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For a global daily dataset with 10 years of data at 1° resolution:
- Dimensions: time=3650, lat=180, lon=360
- Temporal focus: time=365, lat=180, lon=360
- Spatial focus: time=73, lat=180, lon=360
- Balanced: time=183, lat=180, lon=360

Medium Resolution (0.25° to 1°)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For a regional daily dataset with 5 years of data at 0.25° resolution:
- Dimensions: time=1825, lat=720, lon=1440
- Temporal focus: time=182, lat=360, lon=720
- Spatial focus: time=37, lat=720, lon=1440
- Balanced: time=91, lat=540, lon=1080

High Resolution (0.1° or finer)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For a local hourly dataset with 1 year of data at 0.1° resolution:
- Dimensions: time=8760, lat=1800, lon=3600
- Temporal focus: time=100, lat=300, lon=600
- Spatial focus: time=20, lat=900, lon=1800
- Balanced: time=50, lat=600, lon=1200