icechunk.xarray#

Xarray integration for writing datasets to Icechunk.

icechunk.xarray #

Functions:

Name	Description
`to_icechunk`	Write an Xarray object to a group of an Icechunk store.

to_icechunk #

to_icechunk(
    obj,
    session,
    *,
    group=None,
    mode=None,
    safe_chunks=True,
    align_chunks=False,
    append_dim=None,
    region=None,
    encoding=None,
    chunkmanager_store_kwargs=None,
    split_every=None,
)

Write an Xarray object to a group of an Icechunk store.

Parameters:

Name	Type	Description	Default
`obj`	`DataArray \| Dataset`	Xarray object to write	required
`session`	`Session`	Writable Icechunk Session	required
`mode`	`"w", "w-", "a", "a-", r+", None`	Persistence mode: "w" means create (remove old if exists and write new); "w-" means create (fail if exists); "a" means override all existing variables including dimension coordinates (create if does not exist); "a-" means only append those variables that have `append_dim`. "r+" means modify existing array values only (raise an error if any metadata or shapes would change). The default mode is "a" if `append_dim` is set. Otherwise, it is "r+" if `region` is set and `w-` otherwise. .. note:: When modifying an existing Zarr array that is lazily opened, the "w" behavior can be surprising since the underlying file that is being lazily read from might get deleted before the data is computed.	`"w"`
`group`	`str`	Group path. (a.k.a. `path` in zarr terminology.)	`None`
`encoding`	`dict`	Nested dictionary with variable names as keys and dictionaries of variable specific encodings as values, e.g., `{"my_variable": {"dtype": "int16", "scale_factor": 0.1,}, ...}`	`None`
`append_dim`	`hashable`	If set, the dimension along which the data will be appended. All other dimensions on overridden variables must remain the same size.	`None`
`region`	`dict or auto`	Optional mapping from dimension names to either a) `"auto"`, or b) integer slices, indicating the region of existing zarr array(s) in which to write this dataset's data. If `"auto"` is provided the existing store will be opened and the region inferred by matching indexes. `"auto"` can be used as a single string, which will automatically infer the region for all dimensions, or as dictionary values for specific dimensions mixed together with explicit slices for other dimensions. Alternatively integer slices can be provided; for example, `{'x': slice(0, 1000), 'y': slice(10000, 11000)}` would indicate that values should be written to the region `0:1000` along `x` and `10000:11000` along `y`. Users are expected to ensure that the specified region aligns with Zarr chunk boundaries, and that dask chunks are also aligned. Xarray makes limited checks that these multiple chunk boundaries line up. It is possible to write incomplete chunks and corrupt the data with this option if you are not careful.	`None`
`safe_chunks`	`bool`	If True, only allow writes to when there is a many-to-one relationship between Zarr chunks (specified in encoding) and Dask chunks. Set False to override this restriction; however, data may become corrupted if Zarr arrays are written in parallel. In addition to the many-to-one relationship validation, it also detects partial chunks writes when using the region parameter, these partial chunks are considered unsafe in the mode "r+" but safe in the mode "a". Note: Even with these validations it can still be unsafe to write two or more chunked arrays in the same location in parallel if they are not writing in independent regions.	`True`
`align_chunks`	`bool`	If True, rechunks the Dask array to align with Zarr chunks before writing. This ensures each Dask chunk maps to one or more contiguous Zarr chunks, which avoids race conditions. Internally, the process sets safe_chunks=False and tries to preserve the original Dask chunking as much as possible. Note: While this alignment avoids write conflicts stemming from chunk boundary misalignment, it does not protect against race conditions if multiple uncoordinated processes write to the same Zarr array concurrently.	`False`
`chunkmanager_store_kwargs`	`dict`	Additional keyword arguments passed on to the `ChunkManager.store` method used to store chunked arrays. For example for a dask array additional kwargs will be passed eventually to `dask.array.store()`. Experimental API that should not be relied upon.	`None`
`split_every`	`int \| None`	Number of tasks to merge at every level of the tree reduction.	`None`

Returns:

Type	Description
`None`

Notes

Two restrictions apply to the use of region:

If region is set, all variables in a dataset must have at least one dimension in common with the region. Other variables should be written in a separate single call to to_icechunk().
Dimensions cannot be included in both region and append_dim at the same time. To create empty arrays to fill in with region, use the _XarrayDatasetWriter directly.

Source code in icechunk-python/python/icechunk/xarray.py

def to_icechunk(
    obj: DataArray | Dataset,
    session: Session,
    *,
    group: str | None = None,
    mode: ZarrWriteModes | None = None,
    safe_chunks: bool = True,
    align_chunks: bool = False,
    append_dim: Hashable | None = None,
    region: Region = None,
    encoding: Mapping[Any, Any] | None = None,
    chunkmanager_store_kwargs: MutableMapping[Any, Any] | None = None,
    split_every: int | None = None,
) -> None:
    """
    Write an Xarray object to a group of an Icechunk store.

    Parameters
    ----------
    obj: DataArray or Dataset
        Xarray object to write
    session : icechunk.Session
        Writable Icechunk Session
    mode : {"w", "w-", "a", "a-", r+", None}, optional
        Persistence mode:

        - "w" means create (remove old if exists and write new);
        - "w-" means create (fail if exists);
        - "a" means override all existing variables including dimension coordinates (create if does not exist);
        - "a-" means only append those variables that have ``append_dim``.
        - "r+" means modify existing array *values* only (raise an error if
          any metadata or shapes would change).

        The default mode is "a" if ``append_dim`` is set. Otherwise, it is
        "r+" if ``region`` is set and ``w-`` otherwise.

        .. note::
            When modifying an existing Zarr array that is lazily opened, the "w"
            behavior can be surprising since the underlying file that is being
            lazily read from might get deleted before the data is computed.
    group : str, optional
        Group path. (a.k.a. `path` in zarr terminology.)
    encoding : dict, optional
        Nested dictionary with variable names as keys and dictionaries of
        variable specific encodings as values, e.g.,
        ``{"my_variable": {"dtype": "int16", "scale_factor": 0.1,}, ...}``
    append_dim : hashable, optional
        If set, the dimension along which the data will be appended. All
        other dimensions on overridden variables must remain the same size.
    region : dict or "auto", optional
        Optional mapping from dimension names to either a) ``"auto"``, or b) integer
        slices, indicating the region of existing zarr array(s) in which to write
        this dataset's data.

        If ``"auto"`` is provided the existing store will be opened and the region
        inferred by matching indexes. ``"auto"`` can be used as a single string,
        which will automatically infer the region for all dimensions, or as
        dictionary values for specific dimensions mixed together with explicit
        slices for other dimensions.

        Alternatively integer slices can be provided; for example, ``{'x': slice(0,
        1000), 'y': slice(10000, 11000)}`` would indicate that values should be
        written to the region ``0:1000`` along ``x`` and ``10000:11000`` along
        ``y``.

        Users are expected to ensure that the specified region aligns with
        Zarr chunk boundaries, and that dask chunks are also aligned.
        Xarray makes limited checks that these multiple chunk boundaries line up.
        It is possible to write incomplete chunks and corrupt the data with this
        option if you are not careful.
    safe_chunks : bool, default: True
        If True, only allow writes to when there is a many-to-one relationship
        between Zarr chunks (specified in encoding) and Dask chunks.
        Set False to override this restriction; however, data may become corrupted
        if Zarr arrays are written in parallel.
        In addition to the many-to-one relationship validation, it also detects partial
        chunks writes when using the region parameter,
        these partial chunks are considered unsafe in the mode "r+" but safe in
        the mode "a".
        Note: Even with these validations it can still be unsafe to write
        two or more chunked arrays in the same location in parallel if they are
        not writing in independent regions.
    align_chunks: bool, default False
        If True, rechunks the Dask array to align with Zarr chunks before writing.
        This ensures each Dask chunk maps to one or more contiguous Zarr chunks,
        which avoids race conditions.
        Internally, the process sets safe_chunks=False and tries to preserve
        the original Dask chunking as much as possible.
        Note: While this alignment avoids write conflicts stemming from chunk
        boundary misalignment, it does not protect against race conditions
        if multiple uncoordinated processes write to the same
        Zarr array concurrently.
    chunkmanager_store_kwargs : dict, optional
        Additional keyword arguments passed on to the `ChunkManager.store` method used to store
        chunked arrays. For example for a dask array additional kwargs will be passed eventually to
        `dask.array.store()`. Experimental API that should not be relied upon.
    split_every: int, optional
        Number of tasks to merge at every level of the tree reduction.

    Returns
    -------
    None

    Notes
    -----
    Two restrictions apply to the use of ``region``:

      - If ``region`` is set, _all_ variables in a dataset must have at
        least one dimension in common with the region. Other variables
        should be written in a separate single call to ``to_icechunk()``.
      - Dimensions cannot be included in both ``region`` and
        ``append_dim`` at the same time. To create empty arrays to fill
        in with ``region``, use the `_XarrayDatasetWriter` directly.
    """

    as_dataset = _make_dataset(obj)

    writer = _XarrayDatasetWriter(
        as_dataset,
        store=session.store,
        safe_chunks=safe_chunks,
        align_chunks=align_chunks,
    )

    writer._open_group(group=group, mode=mode, append_dim=append_dim, region=region)

    # write metadata
    writer.write_metadata(encoding)
    # write in-memory arrays
    writer.write_eager()

    if is_dask_collection(obj):
        # to offer maximum flexibility to the user; we fork here purely for distributed writes.
        # For example, this allows multiple resizes and distributed writes in the same session
        # where we previously forced a user to commit.
        # See test_distributed_append.py for an example.
        fork: ForkSession = session.fork()
        writer.writer.targets = [
            # Any new arrays have already been created in the `write_metadata` step above
            # so now we just use `mode="a"`.
            zarr.open_array(fork.store, path=target.path, mode="a")
            for target in writer.writer.targets
        ]

        # eagerly write dask arrays
        forked_session = writer.write_lazy(
            chunkmanager_store_kwargs=chunkmanager_store_kwargs,
            split_every=split_every,
        )
        session.merge(forked_session)