Home / icechunk-python / xarray
Icechunk + Xarray#
Icechunk was designed to work seamlessly with Xarray. Xarray users can read and write data to Icechunk using xarray.open_zarr
and icechunk.xarray.to_icechunk
methods.
Warning
Using Xarray and Icechunk together currently requires installing Xarray >= 2025.1.1.
to_icechunk
vs to_zarr
xarray.Dataset.to_zarr
and to_icechunk
are nearly functionally identical.
In a distributed context, e.g. writes orchestrated with multiprocesssing
or a dask.distributed.Client
and dask.array
, you must use to_icechunk
. This will ensure that you can execute a commit that successfully records all remote writes. See these docs on orchestrating parallel writes and these docs on dask.array with distributed for more.
If using to_zarr
, remember to set zarr_format=3, consolidated=False
. Consolidated metadata is unnecessary (and unsupported) in Icechunk. Icechunk already organizes the dataset metadata in a way that makes it very fast to fetch from storage.
In this example, we'll explain how to create a new Icechunk repo, write some sample data to it, and append data a second block of data using Icechunk's version control features.
Create a new repo#
Similar to the example in quickstart, we'll create an Icechunk repo in S3 or a local file system. You will need to replace the StorageConfig
with a bucket or file path that you have access to.
Open tutorial dataset from Xarray#
For this demo, we'll open Xarray's RASM tutorial dataset and split it into two blocks. We'll write the two blocks to Icechunk in separate transactions later in the this example.
Note
Downloading xarray tutorial data requires pooch and netCDF4. These can be installed with
ds = xr.tutorial.open_dataset('rasm')
ds1 = ds.isel(time=slice(None, 18)) # part 1
ds2 = ds.isel(time=slice(18, None)) # part 2
Write Xarray data to Icechunk#
Create a new writable session on the main
branch to get the IcechunkStore
:
Writing Xarray data to Icechunk is as easy as calling to_icechunk
:
After writing, we commit the changes using the session:
Append to an existing store#
Next, we want to add a second block of data to our store. Above, we created ds2
for just this reason. Again, we'll use Dataset.to_zarr
, this time with append_dim='time'
.
# we have to get a new session after committing
session = repo.writable_session("main")
to_icechunk(ds2, session, append_dim='time')
And then we'll commit the changes:
Reading data with Xarray#
<xarray.Dataset> Size: 25MB
Dimensions: (time: 54, y: 205, x: 275)
Coordinates:
* time (time) object 432B 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
xc (y, x) float64 451kB dask.array<chunksize=(103, 275), meta=np.ndarray>
yc (y, x) float64 451kB dask.array<chunksize=(103, 275), meta=np.ndarray>
Dimensions without coordinates: y, x
Data variables:
Tair (time, y, x) float64 24MB dask.array<chunksize=(9, 52, 138), meta=np.ndarray>
Attributes:
title: /workspace/jhamman/processed/R1002RBRxaaa01a/l...
institution: U.W.
source: RACM R1002RBRxaaa01a
output_frequency: daily
output_mode: averaged
convention: CF-1.4
references: Based on the initial model of Liang et al., 19...
comment: Output from the Variable Infiltration Capacity...
nco_openmp_thread_number: 1
NCO: netCDF Operators version 4.7.9 (Homepage = htt...
history: Fri Aug 7 17:57:38 2020: ncatted -a bounds,,d...
We can also read data from previous snapshots by checking out prior versions:
session = repo.readonly_session(snapshot_id=first_snapshot)
print(xr.open_zarr(session.store, consolidated=False))
<xarray.Dataset> Size: 17MB
Dimensions: (time: 36, y: 205, x: 275)
Coordinates:
yc (y, x) float64 451kB dask.array<chunksize=(103, 275), meta=np.ndarray>
* time (time) object 288B 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
xc (y, x) float64 451kB dask.array<chunksize=(103, 275), meta=np.ndarray>
Dimensions without coordinates: y, x
Data variables:
Tair (time, y, x) float64 16MB dask.array<chunksize=(9, 52, 138), meta=np.ndarray>
Attributes:
title: /workspace/jhamman/processed/R1002RBRxaaa01a/l...
institution: U.W.
source: RACM R1002RBRxaaa01a
output_frequency: daily
output_mode: averaged
convention: CF-1.4
references: Based on the initial model of Liang et al., 19...
comment: Output from the Variable Infiltration Capacity...
nco_openmp_thread_number: 1
NCO: netCDF Operators version 4.7.9 (Homepage = htt...
history: Fri Aug 7 17:57:38 2020: ncatted -a bounds,,d...
Notice that this second xarray.Dataset
has a time dimension of length 18 whereas the first has a time dimension of length 36.
Next steps#
For more details on how to use Xarray's Zarr integration, checkout Xarray's documentation.