Skip to content

Home / getting-started / howto

How To: Common Icechunk Operations#

This page gathers common Icechunk operations into one compact how-to guide. It is not intended as a deep explanation of how Icechunk works.

Creating and Opening Repos#

Creating and opening repos requires creating a Storage object. See the Storage guide for all the details.

Create a New Repo#

storage = icechunk.s3_storage(bucket="my-bucket", prefix="my-prefix", from_env=True)
repo = icechunk.Repository.create(storage)

Open an Existing Repo#

repo = icechunk.Repository.open(storage)

Specify Custom Config when Opening a Repo#

There are many configuration options available to control the behavior of the repository and the storage backend. See Configuration for all the details.

config = icechunk.RepositoryConfig.default()
config.caching = icechunk.CachingConfig(num_bytes_chunks=100_000_000)
repo = icechunk.Repository.open(storage, config=config)

Set Repo Metadata#

If you manage a number of Icechunk repositories, it could be useful to classify them using metadata. Icechunk allows you to set and retrieve arbitrary JSON-like metadata at the repository level.

repo = icechunk.Repository.open(...)
repo.set_metadata(dict(test=True, team="science"))
repo.update_metadata(dict(number_of_bugs=42))
print(repo.get_metadata())

Deleting a Repo#

Icechunk doesn't provide a way to delete a repo once it has been created. If you need to delete a repo, just go to the underlying storage and remove the directory where you created the repo.

Reading, Writing, and Modifying Data with Zarr#

For a full walkthrough, see the Quickstart.

Read and write operations occur within the context of a transaction. The general pattern is

session = repo.writable_session(branch="main")
# interact with the repo via session.store
# ...
session.commit(message="wrote some data")

Info

In the examples below, we just show the interaction with the store object. Keep in mind that all sessions need to be concluded with a .commit().

Alternatively, you can also use the .transaction function as a context manager, which automatically commits when the context exits.

with repo.transaction(branch="main", message="wrote some data") as store:
    # interact with the repo via store

Create a Group#

group = zarr.create_group(session.store, path="my-group", zarr_format=3)

Create an Array#

array = group.create("my_array", shape=(10, 20), dtype='int32')

Write Data to an Array#

array[2:5, :10] = 1

Read Data from an Array#

data = array[:5, :10]

Resize an Array#

array.resize((20, 30))

Add or Modify Array / Group Attributes#

array.attrs["standard_name"] = "time"

View Array / Group Attributes#

dict(array.attrs)

Delete a Group#

del group["subgroup"]

Delete an Array#

del group["array"]

Reading and Writing Data with Xarray#

For more depth, see Xarray, Parallel writes, and Dask.

Write an in-memory Xarray Dataset#

ds.to_zarr(session.store, group="my-group", zarr_format=3, consolidated=False)

Append to an existing datast#

ds.to_zarr(session.store, group="my-group", append_dim='time', consolidated=False)

Write an Xarray dataset with Dask#

Writing with Dask or any other parallel execution framework requires special care. See Parallel writes and Xarray for more detail.

from icechunk.xarray import to_icechunk
to_icechunk(ds, session)

Read a dataset with Xarray#

Reading can be done with a read-only session.

session = repo.readonly_session("main")
ds = xr.open_zarr(session.store, group="my-group", zarr_format=3, consolidated=False)

Transactions and Version Control#

For more depth, see Transactions and Version Control.

Create a Snapshot via a Transaction#

snapshot_id = session.commit("commit message")

Resolve a Commit Conflict#

The case of no actual conflicts:

try:
    session.commit("commit message")
except icechunk.ConflictError:
    session.rebase(icechunk.ConflictDetector())
    session.commit("committed after rebasing")

Or if you have conflicts between different commits and want to overwrite the other changes:

try:
    session.commit("commit message")
except icechunk.ConflictError:
    session.rebase(icechunk.BasicConflictSolver(on_chunk_conflict=icechunk.VersionSelection.UseOurs))
    session.commit("committed after rebasing")

Commit with Automatic Rebasing#

This will automatically retry the commit until it succeeds

session.commit("commit message", rebase_with=icechunk.ConflictDetector())

List Snapshots#

for snapshot in repo.ancestry(branch="main"):
    print(snapshot)

Check out a Snapshot#

session = repo.readonly_session(snapshot_id=snapshot_id)

Amend a Snapshot#

For more, see amending.

session = repo.writable_session("branch_name")
# make changes
session.amend(message="...")

Create an empty snapshot#

session = repo.writable_session("branch_name")
# no changes
session.commit(message="...", metadata={"foo": "bar"} allow_empty=True)

Create a Branch#

repo.create_branch("dev", snapshot_id=snapshot_id)

List all Branches#

branches = repo.list_branches()

Check out a Branch#

session = repo.writable_session("dev")

Reset a Branch to a Different Snapshot#

repo.reset_branch("dev", snapshot_id=snapshot_id)

Create a Tag#

repo.create_tag("v1.0.0", snapshot_id=snapshot_id)

List all Tags#

tags = repo.list_tags()

Check out a Tag#

session = repo.readonly_session(tag="v1.0.0")

Delete a Tag#

repo.delete_tag("v1.0.0")

Diff Two Versions#

diff = repo.diff(from_tag="v1.0.0", to_branch="main")

Moving Chunks and Nodes#

For more depth, see Moving Chunks and Moving and Renaming Nodes.

Shift All Chunks by a Fixed Offset#

Offsets are in chunks, not array elements. Out-of-bounds chunks are discarded; vacated positions reset to fill value.

session.shift_array("/my_array", offset=(-2, 0))

Reindex Chunks with a Custom Function#

Provide a forward function mapping old chunk index to new. Return None to drop a chunk. Add a backward function (the inverse of forward) to correctly clear stale positions when empty chunks exist.

def fwd(idx):
    new = idx[0] - 2
    return [new] if new >= 0 else None

def bwd(idx):
    new = idx[0] + 2
    return [new] if new < n_chunks else None

session.reindex_array("/my_array", forward=fwd, backward=bwd)

Move or Rename an Array or Group#

Moving and renaming requires a rearrange session.

session = repo.rearrange_session("main")
session.move("/old/path", "/new/path")
session.commit("Renamed old to new")

Repo Maintenance#

For more depth, see Data Expiration.

Run Snapshot Expiration#

from datetime import datetime, timedelta
expiry_time = datetime.now() - timedelta(days=10)
expired = repo.expire_snapshots(older_than=expiry_time)

Run Garbage Collection#

results = repo.garbage_collect(expiry_time)

Usage in async contexts#

Most methods in Icechunk have an async counterpart, named with an _async postfix. For more info, see Async Usage.

results = await repo.garbage_collect_async(expiry_time)