Home / icechunk-python / xarray
Icechunk + Xarray
Icechunk was designed to work seamlessly with Xarray. Xarray users can read and write data to Icechunk using xarray.open_zarr
and xarray.Dataset.to_zarr
.
Warning
Using Xarray and Icechunk together currently requires installing Xarray >= 2024.11.0.
In this example, we'll explain how to create a new Icechunk repo, write some sample data to it, and append data a second block of data using Icechunk's version control features.
Create a new repo
Similar to the example in quickstart, we'll create an Icechunk repo in S3 or a local file system. You will need to replace the StorageConfig
with a bucket or file path that you have access to.
Open tutorial dataset from Xarray
For this demo, we'll open Xarray's RASM tutorial dataset and split it into two blocks. We'll write the two blocks to Icechunk in separate transactions later in the this example.
Note
Downloading xarray tutorial data requires pooch and netCDF4. These can be installed with
ds = xr.tutorial.open_dataset('rasm')
ds1 = ds.isel(time=slice(None, 18)) # part 1
ds2 = ds.isel(time=slice(18, None)) # part 2
Write Xarray data to Icechunk
Create a new writable session on the main
branch to get the IcechunkStore
:
Writing Xarray data to Icechunk is as easy as calling Dataset.to_zarr
:
Note
- Consolidated metadata is unnecessary (and unsupported) in Icechunk. Icechunk already organizes the dataset metadata in a way that makes it very fast to fetch from storage.
zarr_format=3
is required until the default Zarr format changes in Xarray.
After writing, we commit the changes using the session:
Append to an existing store
Next, we want to add a second block of data to our store. Above, we created ds2
for just this reason. Again, we'll use Dataset.to_zarr
, this time with append_dim='time'
.
# we have to get a new session after committing
session = repo.writable_session("main")
ds2.to_zarr(session.store(), append_dim='time')
And then we'll commit the changes:
Reading data with Xarray
To read data stored in Icechunk with Xarray, we'll use xarray.open_zarr
:
xr.open_zarr(store, consolidated=False)
# output: <xarray.Dataset> Size: 17MB
# Dimensions: (time: 36, y: 205, x: 275)
# Coordinates:
# * time (time) object 288B 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
# xc (y, x) float64 451kB dask.array<chunksize=(103, 275), meta=np.ndarray>
# yc (y, x) float64 451kB dask.array<chunksize=(103, 275), meta=np.ndarray>
# Dimensions without coordinates: y, x
# Data variables:
# Tair (time, y, x) float64 16MB dask.array<chunksize=(5, 103, 138), meta=np.ndarray>
# Attributes:
# NCO: netCDF Operators version 4.7.9 (Homepage = htt...
# comment: Output from the Variable Infiltration Capacity...
# convention: CF-1.4
# history: Fri Aug 7 17:57:38 2020: ncatted -a bounds,,d...
# institution: U.W.
# nco_openmp_thread_number: 1
# output_frequency: daily
# output_mode: averaged
# references: Based on the initial model of Liang et al., 19...
# source: RACM R1002RBRxaaa01a
# title: /workspace/jhamman/processed/R1002RBRxaaa01a/l...
We can also read data from previous snapshots by checking out prior versions:
store = repo.readable_session(snapshot_id='ME4VKFPA5QAY0B2YSG8G').store()
xr.open_zarr(store, consolidated=False)
# <xarray.Dataset> Size: 9MB
# Dimensions: (time: 18, y: 205, x: 275)
# Coordinates:
# xc (y, x) float64 451kB dask.array<chunksize=(103, 275), meta=np.ndarray>
# yc (y, x) float64 451kB dask.array<chunksize=(103, 275), meta=np.ndarray>
# * time (time) object 144B 1980-09-16 12:00:00 ... 1982-02-15 12:00:00
# Dimensions without coordinates: y, x
# Data variables:
# Tair (time, y, x) float64 8MB dask.array<chunksize=(5, 103, 138), meta=np.ndarray>
# Attributes:
# NCO: netCDF Operators version 4.7.9 (Homepage = htt...
# comment: Output from the Variable Infiltration Capacity...
# convention: CF-1.4
# history: Fri Aug 7 17:57:38 2020: ncatted -a bounds,,d...
# institution: U.W.
# nco_openmp_thread_number: 1
# output_frequency: daily
# output_mode: averaged
# references: Based on the initial model of Liang et al., 19...
# source: RACM R1002RBRxaaa01a
# title: /workspace/jhamman/processed/R1002RBRxaaa01a/l...
Notice that this second xarray.Dataset
has a time dimension of length 18 whereas the first has a time dimension of length 36.
Next steps
For more details on how to use Xarray's Zarr integration, checkout Xarray's documentation.