Datamesh Integration#
The zarrio library supports integration with Oceanum’s Datamesh platform, allowing you to write Zarr data directly to Datamesh datasources.
Configuration#
To use Datamesh integration, you need to configure the datamesh
section in your ZarrConverterConfig
:
from zarrio import ZarrConverterConfig
config = ZarrConverterConfig(
datamesh={
"datasource": {
"id": "my_datasource",
"name": "My Data",
"description": "My dataset",
"coordinates": {"x": "longitude", "y": "latitude", "t": "time"},
"details": "https://example.com",
"tags": ["zarrio", "datamesh"]
},
"token": "your_datamesh_token",
"service": "https://datamesh-v1.oceanum.io" # Optional, defaults to this value
}
)
Using with the API#
from zarrio import ZarrConverter
# Create converter with datamesh configuration
converter = ZarrConverter(config=config)
# Convert data to datamesh (output_path is optional when using datamesh)
converter.convert("input.nc")
Using with the CLI#
You can also use the CLI with datamesh:
# Convert to datamesh datasource
zarrio convert input.nc \\
--datamesh-datasource '{"id":"my_datasource","name":"My Data","coordinates":{"x":"longitude","y":"latitude","t":"time"}}' \\
--datamesh-token $DATAMESH_TOKEN
# Create template for parallel writing
zarrio create-template template.nc \\
--datamesh-datasource '{"id":"my_datasource","name":"My Data","coordinates":{"x":"longitude","y":"latitude","t":"time"}}' \\
--datamesh-token $DATAMESH_TOKEN \\
--global-start 2023-01-01 \\
--global-end 2023-12-31
# Write region to datamesh datasource
zarrio write-region data.nc \\
--datamesh-datasource '{"id":"my_datasource","name":"My Data","coordinates":{"x":"longitude","y":"latitude","t":"time"}}' \\
--datamesh-token $DATAMESH_TOKEN
Datamesh Datasource Configuration#
The DatameshDatasource
model supports the following fields:
id
(required): The unique identifier for the datasourcename
: Human-readable name for the datasourcedescription
: Description of the datasourcecoordinates
: Coordinate mapping (e.g.,{"x": "longitude", "y": "latitude", "t": "time"}
)details
: URL with more details about the datasourcetags
: Tags associated with the datasourcedriver
: Driver to use for datamesh datasource (defaults to “vzarr”)dataschema
: Explicit schema for the datasourcegeometry
: Explicit geometry for the datasourcetstart
: Explicit start time for the datasourcetend
: Explicit end time for the datasource
Installation#
To use datamesh functionality, you need to install the optional datamesh dependencies:
pip install zarrio[datamesh]
Advanced Usage#
Configuration File Support#
You can also configure datamesh integration using YAML or JSON configuration files:
# config.yaml
chunking:
time: 100
lat: 50
lon: 100
compression:
method: blosc:zstd:3
datamesh:
datasource:
id: my_climate_data
name: My Climate Data
description: Climate data converted with zarrio
coordinates:
x: lon
y: lat
t: time
details: https://example.com
tags:
- climate
- zarrio
- datamesh
token: your_datamesh_token
service: https://datamesh-v1.oceanum.io
# Load from YAML file
converter = ZarrConverter.from_config_file("config.yaml")
converter.convert("input.nc")
# Use with CLI
zarrio convert input.nc --config config.yaml
Metadata Management#
When writing to datamesh, zarrio automatically manages metadata:
Schema: Automatically generated from the dataset structure
Geometry: Calculated from coordinate variables
Time Range: Extracted from the time dimension
You can override any of these by explicitly setting them in your configuration:
config = ZarrConverterConfig(
datamesh={
"datasource": {
"id": "my_datasource",
"name": "My Data",
"dataschema": {"custom": "schema"}, # Override auto-generated schema
"geometry": {"type": "Point", "coordinates": [0, 0]}, # Override auto-generated geometry
"tstart": "2023-01-01T00:00:00", # Override auto-detected start time
"tend": "2023-12-31T23:59:59" # Override auto-detected end time
},
"token": "your_datamesh_token"
}
)
Parallel Writing with Datamesh#
All parallel writing features work with datamesh:
from zarrio import ZarrConverter
# Create converter with datamesh configuration
converter = ZarrConverter(
chunking={"time": 100, "lat": 50, "lon": 100},
compression={"method": "blosc:zstd:3"},
datamesh={
"datasource": {
"id": "my_parallel_data",
"name": "Parallel Climate Data",
"coordinates": {"x": "lon", "y": "lat", "t": "time"}
},
"token": "your_datamesh_token"
}
)
# Create template covering full time range
template_ds = xr.open_dataset("template.nc")
converter.create_template(
template_dataset=template_ds,
global_start="2020-01-01",
global_end="2023-12-31",
compute=False # Metadata only
)
# Write regions in parallel processes
converter.write_region("data_2020.nc") # Process 1
converter.write_region("data_2021.nc") # Process 2
converter.write_region("data_2022.nc") # Process 3
converter.write_region("data_2023.nc") # Process 4
# CLI equivalent
zarrio create-template template.nc \\
--datamesh-datasource '{"id":"my_parallel_data","name":"Parallel Climate Data","coordinates":{"x":"lon","y":"lat","t":"time"}}' \\
--datamesh-token $DATAMESH_TOKEN \\
--global-start 2020-01-01 \\
--global-end 2023-12-31
# In parallel processes:
zarrio write-region data_2020.nc --datamesh-token $DATAMESH_TOKEN # Process 1
zarrio write-region data_2021.nc --datamesh-token $DATAMESH_TOKEN # Process 2
zarrio write-region data_2022.nc --datamesh-token $DATAMESH_TOKEN # Process 3
zarrio write-region data_2023.nc --datamesh-token $DATAMESH_TOKEN # Process 4
Error Handling#
Datamesh integration includes proper error handling:
from zarrio.exceptions import ConversionError
try:
converter.convert("input.nc")
except ConversionError as e:
print(f"Conversion failed: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Best Practices#
Token Management: Store your datamesh token securely, preferably as an environment variable:
export DATAMESH_TOKEN="your_actual_token_here" zarrio convert input.nc --datamesh-token $DATAMESH_TOKEN ...
Datasource IDs: Use descriptive, unique IDs for your datasources.
Coordinate Mapping: Ensure your coordinate mapping matches your dataset’s variable names.
Testing: Test your configuration with small datasets before processing large amounts of data.
Metadata: Provide meaningful names, descriptions, and tags to make your datasources discoverable.
Example Workflow#
Here’s a complete example workflow:
import os
from zarrio import ZarrConverter, ZarrConverterConfig
# 1. Configure for datamesh
config = ZarrConverterConfig(
chunking={"time": 100, "lat": 50, "lon": 100},
compression={"method": "blosc:zstd:3"},
packing={"enabled": True, "bits": 16},
datamesh={
"datasource": {
"id": "climate_data_2023",
"name": "Climate Data 2023",
"description": "Global climate data for 2023",
"coordinates": {"x": "longitude", "y": "latitude", "t": "time"},
"details": "https://example.com/climate-data-2023",
"tags": ["climate", "2023", "global", "temperature", "pressure"]
},
"token": os.getenv("DATAMESH_TOKEN"), # Use environment variable
"service": "https://datamesh-v1.oceanum.io"
}
)
# 2. Create converter
converter = ZarrConverter(config=config)
# 3. Convert data
converter.convert("climate_data_2023.nc")
print("Data successfully written to datamesh!")