Home / icechunk-python / version-control
Version Control
Icechunk carries over concepts from other version control software (e.g. Git) to multidimensional arrays. Doing so helps ease the burden of managing multiple versions of your data, and helps you be precise about which version of your dataset is being used for downstream purposes.
Core concepts of Icechunk's version control system are:
- A snapshot bundles together related data and metadata changes in a single "transaction".
- A branch points to the latest snapshot in a series of snapshots. Multiple branches can co-exist at a given time, and multiple users can add snapshots to a single branch. One common pattern is to use dev, stage, and prod branches to separate versions of a dataset.
- A tag is an immutable reference to a snapshot, usually used to represent an "important" version of the dataset such as a release.
Snapshots, branches, and tags all refer to specific versions of your dataset. You can time-travel/navigate back to any version of your data as referenced by a snapshot, a branch, or a tag using a snapshot ID, a branch name, or a tag name when creating a new Session
.
Setup
To get started, we can create a new Repository
.
Note
This example uses an in-memory storage backend, but you can also use any other storage backend instead.
On creating a new Repository
, it will automatically create a main
branch with an initial snapshot. We can take a look at the ancestry of the main
branch to confirm this.
repo.ancestry(branch="main")
# [SnapshotInfo(id="A840RMN5CF807CM66RY0", parent_id=None, written_at=datetime.datetime(2025,1,30,19,52,41,592998, tzinfo=datetime.timezone.utc), message="Repository...")]
Note
The ancestry
method can be used to inspect the ancestry of any branch, snapshot, or tag.
We get back a list of SnapshotInfo
objects, which contain information about the snapshot, including its ID, the ID of its parent snapshot, and the time it was written.
Creating a snapshot
Now that we have a Repository
with a main
branch, we can modify the data in the repository and create a new snapshot. First we need to create a writable from the main
branch.
Note
Writable Session
objects are required to create new snapshots, and can only be created from the tip of a branch. Checking out tags or other snapshots is read-only.
We can now access the zarr.Store
from the Session
and create a new root group. Then we can modify the attributes of the root group and create a new snapshot.
import zarr
root = zarr.group(session.store)
root.attrs["foo"] = "bar"
session.commit(message="Add foo attribute to root group")
# 'J1ZJHS4EEQW3ATKMV9TG'
Success! We've created a new snapshot with a new attribute on the root group.
Once we've committed the snapshot, the Session
will become read-only, and we can no longer modify the data using our existing Session
. If we want to modify the data again, we need to create a new writable Session
from the branch. Notice that we don't have to refresh the Repository
to get the updates from the main
branch. Instead, the Repository
will automatically fetch the latest snapshot from the branch when we create a new writable Session
from it.
session = repo.writable_session(branch="main")
root = zarr.group(session.store)
root.attrs["foo"] = "baz"
session.commit(message="Update foo attribute on root group")
# 'BZ9YP38SWPW2E784VAB0'
With a few snapshots committed, we can take a look at the ancestry of the main
branch:
for snapshot in repo.ancestry(branch="main"):
print(snapshot)
# SnapshotInfo(id="BZ9YP38SWPW2E784VAB0", parent_id="J1ZJHS4EEQW3ATKMV9TG", written_at=datetime.datetime(2025,1,30,20,26,51,115330, tzinfo=datetime.timezone.utc), message="Update foo...")
# SnapshotInfo(id="J1ZJHS4EEQW3ATKMV9TG", parent_id="A840RMN5CF807CM66RY0", written_at=datetime.datetime(2025,1,30,20,26,50,9616, tzinfo=datetime.timezone.utc), message="Add foo at...")
# SnapshotInfo(id="A840RMN5CF807CM66RY0", parent_id=None, written_at=datetime.datetime(2025,1,30,20,26,47,66157, tzinfo=datetime.timezone.utc), message="Repository...")
Visually, this looks like below, where the arrows represent the parent-child relationship between snapshots.
gitGraph
commit id: "A840RMN5" type: NORMAL
commit id: "J1ZJHS4" type: NORMAL
commit id: "BZ9YP38" type: NORMAL
Time Travel
Now that we've created a new snapshot, we can time-travel back to the previous snapshot using the snapshot ID.
Note
It's important to note that because the zarr Store
is read-only, we need to pass mode="r"
to the zarr.open_group
function.
session = repo.readonly_session(snapshot_id="BSHY7B1AGAPWQC14Q18G")
root = zarr.open_group(session.store, mode="r")
root.attrs["foo"]
# 'bar'
Branches
If we want to modify the data from a previous snapshot, we can create a new branch from that snapshot with create_branch
.
We can now create a new writable Session
from the dev
branch and modify the data.
session = repo.writable_session(branch="dev")
root = zarr.group(session.store)
root.attrs["foo"] = "balogna"
session.commit(message="Update foo attribute on root group")
# 'H1M3R93ZW19MYKCYASH0'
We can also create a new branch from the tip of the main
branch if we want to modify our current working branch without modifying the main
branch.
main_branch_snapshot_id = repo.lookup_branch("main")
repo.create_branch("feature", snapshot_id=main_branch_snapshot_id)
session = repo.writable_session(branch="feature")
root = zarr.group(session.store)
root.attrs["foo"] = "cherry"
session.commit(message="Update foo attribute on root group")
# 'S3QY2RDQQTRYFGJDTB6G'
With these branches created, the hierarchy of the repository now looks like below.
gitGraph
commit id: "A840RMN5" type: NORMAL
commit id: "J1ZJHS4" type: NORMAL
branch dev
checkout dev
commit id: "H1M3R93" type: NORMAL
checkout main
commit id: "BZ9YP38" type: NORMAL
checkout main
branch feature
commit id: "S3QY2RD" type: NORMAL
We can also list all branches in the repository.
If we need to find the snapshot that a branch is based on, we can use the lookup_branch
method.
We can also delete a branch with delete_branch
.
Finally, we can reset a branch to a previous snapshot with reset_branch
. This immediately modifies the branch tip to the specified snapshot, changing the history of the branch.
Tags
Tags are immutable references to a snapshot. They are created with create_tag
.
Because tags are immutable, we need to use a readonly Session
to access the data referenced by a tag.
session = repo.readonly_session(tag="v1.0.0")
root = zarr.open_group(session.store, mode="r")
root.attrs["foo"]
# 'bar'
gitGraph
commit id: "A840RMN5" type: NORMAL
commit id: "J1ZJHS4" type: NORMAL
commit tag: "v1.0.0"
commit id: "BZ9YP38" type: NORMAL
We can also list all tags in the repository.
and we can look up the snapshot that a tag is based on with lookup_tag
.
And then finally delete a tag with delete_tag
.
Note
Tags are immutable and once a tag is deleted, it can never be recreated.
Conflict Resolution
Icechunk is a serverless distributed system, and as such, it is possible to have multiple users or processes modifying the same data at the same time. Icechunk relies on the consistency guarantees of the underlying storage backends to ensure that the data is always consistent. In situations where two users or processes attempt to modify the same data at the same time, Icechunk will detect the conflict and raise an exception at commit time. This can be illustrated with the following example.
Let's create a fresh repository, add some attributes to the root group and create an array named data
.
import icechunk
import numpy as np
import zarr
repo = icechunk.Repository.create(icechunk.in_memory_storage())
session = repo.writable_session(branch="main")
root = zarr.group(session.store)
root.attrs["foo"] = "bar"
root.create_dataset("data", shape=(10, 10), chunks=(1, 1), dtype=np.int32)
session.commit(message="Add foo attribute and data array")
# 'BG0W943WSNFMMVD1FXJ0'
Lets try to modify the data
array in two different sessions, created from the main
branch.
session1 = repo.writable_session(branch="main")
session2 = repo.writable_session(branch="main")
root1 = zarr.group(session1.store)
root2 = zarr.group(session2.store)
First, we'll modify the attributes of the root group from both sessions.
and then try to commit the changes.
session1.commit(message="Update foo attribute on root group")
session2.commit(message="Update foo attribute on root group")
# AE9XS2ZWXT861KD2JGHG
# ---------------------------------------------------------------------------
# ConflictError Traceback (most recent call last)
# Cell In[7], line 11
# 8 root2.attrs["foo"] = "baz"
# 10 print(session1.commit(message="Update foo attribute on root group"))
# ---> 11 print(session2.commit(message="Update foo attribute on root group"))
# File ~/Developer/icechunk/icechunk-python/python/icechunk/session.py:224, in Session.commit(self, message, metadata)
# 222 return self._session.commit(message, metadata)
# 223 except PyConflictError as e:
# --> 224 raise ConflictError(e) from None
# ConflictError: Failed to commit, expected parent: Some("BG0W943WSNFMMVD1FXJ0"), actual parent: Some("AE9XS2ZWXT861KD2JGHG")
The first session was able to commit successfully, but the second session failed with a ConflictError
. When the second session was created, the changes made were relative to the tip of the main
branch, but the tip of the main
branch had been modified by the first session.
To resolve this conflict, we can use the rebase
functionality.
Rebasing
To update the second session so it is based off the tip of the main
branch, we can use the rebase
method.
First, we can try to rebase, without merging any conflicting changes:
session2.rebase(icechunk.ConflictDetector())
# ---------------------------------------------------------------------------
# RebaseFailedError Traceback (most recent call last)
# Cell In[8], line 1
# ----> 1 session2.rebase(icechunk.ConflictDetector())
# File ~/Developer/icechunk/icechunk-python/python/icechunk/session.py:247, in Session.rebase(self, solver)
# 245 self._session.rebase(solver)
# 246 except PyRebaseFailedError as e:
# --> 247 raise RebaseFailedError(e) from None
# RebaseFailedError: Rebase failed on snapshot AE9XS2ZWXT861KD2JGHG: 1 conflicts found
This however fails because both sessions modified the foo
attribute on the root group. We can use the ConflictError
to get more information about the conflict.
try:
session2.rebase(icechunk.ConflictDetector())
except icechunk.RebaseFailedError as e:
print(e.conflicts)
# [Conflict(UserAttributesDoubleUpdate, path=/)]
This tells us that the conflict is caused by the two sessions modifying the user attributes of the root group (/
). In this casewe have decided that second session set the foo
attribute to the correct value, so we can now try to rebase by instructing the rebase
method to use the second session's changes with the BasicConflictSolver
.
session2.rebase(icechunk.BasicConflictSolver(on_user_attributes_conflict=icechunk.VersionSelection.UseOurs))
Success! We can now try and commit the changes again.
This same process can be used to resolve conflicts with arrays. Let's try to modify the data
array from both sessions.
session1 = repo.writable_session(branch="main")
session2 = repo.writable_session(branch="main")
root1 = zarr.group(session1.store)
root2 = zarr.group(session2.store)
root1["data"][0,0] = 1
root2["data"][0,:] = 2
We have now created a conflict, because the first session modified the first element of the data
array, and the second session modified the first row of the data
array. Let's commit the changes from the second session first, then see what conflicts are reported when we try to commit the changes from the first session.
print(session2.commit(message="Update first row of data array"))
print(session1.commit(message="Update first element of data array"))
# ---------------------------------------------------------------------------
# ConflictError Traceback (most recent call last)
# Cell In[15], line 2
# 1 print(session2.commit(message="Update first row of data array"))
# ----> 2 print(session1.commit(message="Update first element of data array"))
# File ~/Developer/icechunk/icechunk-python/python/icechunk/session.py:224, in Session.commit(self, message, metadata)
# 222 return self._session.commit(message, metadata)
# 223 except PyConflictError as e:
# --> 224 raise ConflictError(e) from None
# ConflictError: Failed to commit, expected parent: Some("SY4WRE8A9TVYMTJPEAHG"), actual parent: Some("5XRDGZPSG747AMMRTWT0")
Okay! We have a conflict. Lets see what conflicts are reported.
try:
session1.rebase(icechunk.ConflictDetector())
except icechunk.RebaseFailedError as e:
for conflict in e.conflicts:
print(f"Conflict at {conflict.path}: {conflict.conflicted_chunks}")
# Conflict at /data: [[0, 0]]
We get a clear indication of the conflict, and the chunks that are conflicting. In this case we have decided that the first session's changes are correct, so we can again use the BasicConflictSolver
to resolve the conflict.
session1.rebase(icechunk.BasicConflictSolver(on_chunk_conflict=icechunk.VersionSelection.UseOurs))
session1.commit(message="Update first element of data array")
# 'R4WXW2CYNAZTQ3HXTNK0'
Success! We have now resolved the conflict and committed the changes.
Let's look at the value of the data
array to confirm that the conflict was resolved correctly.
session = repo.readonly_session(branch="main")
root = zarr.open_group(session.store, mode="r")
root["data"][0,:]
# array([1, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)
Lastly, if you make changes to non-conflicting chunks or attributes, you can rebase without having to resolve any conflicts.
session1 = repo.writable_session(branch="main")
session2 = repo.writable_session(branch="main")
root1 = zarr.group(session1.store)
root2 = zarr.group(session2.store)
root1["data"][3,:] = 3
root2["data"][4,:] = 4
session1.commit(message="Update fourth row of data array")
try:
session2.rebase(icechunk.ConflictDetector())
print("Rebase succeeded")
except icechunk.RebaseFailedError as e:
print(e.conflicts)
session2.commit(message="Update fifth row of data array")
# Rebase succeeded
And now we can see the data in the data
array to confirm that the changes were committed correctly.
session = repo.readonly_session(branch="main")
root = zarr.open_group(session.store, mode="r")
root["data"][:,:]
# array([[1, 2, 2, 2, 2, 2, 2, 2, 2, 2],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
# [4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)
Limitations
At the moment, the rebase functionality is limited to resolving conflicts with attributes on arrays and groups, and conflicts with chunks in arrays. Other types of conflicts are not able to be resolved by icechunk yet and must be resolved manually.