Home / icechunk-python / expiration
Expiring Data#
Over time, an Icechunk Repository will accumulate many snapshots, not all of which need to be kept around.
"Expiration" allows you to mark snapshots as expired, and "garbage collection" deletes all data (manifests, chunks, snapshots, etc.) associated with expired snapshots.
First create a Repository, configured so that there are no "inline" chunks. This will help illustrate that data is actually deleted.
import icechunk
repo = icechunk.Repository.create(
icechunk.in_memory_storage(),
config=icechunk.RepositoryConfig(inline_chunk_threshold_bytes=0),
)
Generate a few snapshots#
Let us generate a sequence of snapshots
import zarr
import time
for i in range(10):
session = repo.writable_session("main")
array = zarr.create_array(
session.store, name="array", shape=(10,), fill_value=-1, dtype=int, overwrite=True
)
array[:] = i
session.commit(f"snap {i}")
time.sleep(0.1)
There are 10 snapshots
ancestry = list(repo.ancestry(branch="main"))
print("\n\n".join([str((a.id, a.written_at)) for a in ancestry]))
('KH8QXTV32Z78HQSE1V70', datetime.datetime(2025, 4, 28, 2, 14, 3, 485117, tzinfo=datetime.timezone.utc))
('BTP7SX6Y67TCEZKQR2N0', datetime.datetime(2025, 4, 28, 2, 14, 3, 379788, tzinfo=datetime.timezone.utc))
('TK9FB94FZ4E01R7BPRYG', datetime.datetime(2025, 4, 28, 2, 14, 3, 274442, tzinfo=datetime.timezone.utc))
('QJ55STRHH0ZS79DW6C9G', datetime.datetime(2025, 4, 28, 2, 14, 3, 169301, tzinfo=datetime.timezone.utc))
('PHEVNETB7BW3BWMS9ZW0', datetime.datetime(2025, 4, 28, 2, 14, 3, 64182, tzinfo=datetime.timezone.utc))
('2HADTNH0FTATDQ0Q0WN0', datetime.datetime(2025, 4, 28, 2, 14, 2, 958957, tzinfo=datetime.timezone.utc))
('926GA6V46JQMZHSWATB0', datetime.datetime(2025, 4, 28, 2, 14, 2, 853904, tzinfo=datetime.timezone.utc))
('A5S2DM6GBQ9N232AKCWG', datetime.datetime(2025, 4, 28, 2, 14, 2, 748661, tzinfo=datetime.timezone.utc))
('FKNQ6TR1FX7C12V2AY1G', datetime.datetime(2025, 4, 28, 2, 14, 2, 643432, tzinfo=datetime.timezone.utc))
('AWKDY4WPG2P9E36JXR10', datetime.datetime(2025, 4, 28, 2, 14, 2, 538001, tzinfo=datetime.timezone.utc))
('KP4446Y9N64A324MVMS0', datetime.datetime(2025, 4, 28, 2, 14, 2, 530819, tzinfo=datetime.timezone.utc))
Expire snapshots#
Danger
Expiring snapshots is an irreversible operation. Use it with care.
First we must expire snapshots. Here we will expire any snapshot older than the 5th one.
2025-04-28 02:14:02.958957+00:00
{'A5S2DM6GBQ9N232AKCWG', 'FKNQ6TR1FX7C12V2AY1G', '926GA6V46JQMZHSWATB0', 'AWKDY4WPG2P9E36JXR10'}
This prints out the set of snapshots that were expired.
Note
The first snapshot is never expired!
Confirm that these are the right snapshots (remember that ancestry list commits in decreasing order of written_at
time):
['926GA6V46JQMZHSWATB0', 'A5S2DM6GBQ9N232AKCWG', 'FKNQ6TR1FX7C12V2AY1G', 'AWKDY4WPG2P9E36JXR10']
Note that ancestry is now shorter:
new_ancestry = list(repo.ancestry(branch="main"))
print("\n\n".join([str((a.id, a.written_at)) for a in new_ancestry]))
('KH8QXTV32Z78HQSE1V70', datetime.datetime(2025, 4, 28, 2, 14, 3, 485117, tzinfo=datetime.timezone.utc))
('BTP7SX6Y67TCEZKQR2N0', datetime.datetime(2025, 4, 28, 2, 14, 3, 379788, tzinfo=datetime.timezone.utc))
('TK9FB94FZ4E01R7BPRYG', datetime.datetime(2025, 4, 28, 2, 14, 3, 274442, tzinfo=datetime.timezone.utc))
('QJ55STRHH0ZS79DW6C9G', datetime.datetime(2025, 4, 28, 2, 14, 3, 169301, tzinfo=datetime.timezone.utc))
('PHEVNETB7BW3BWMS9ZW0', datetime.datetime(2025, 4, 28, 2, 14, 3, 64182, tzinfo=datetime.timezone.utc))
('2HADTNH0FTATDQ0Q0WN0', datetime.datetime(2025, 4, 28, 2, 14, 2, 958957, tzinfo=datetime.timezone.utc))
('KP4446Y9N64A324MVMS0', datetime.datetime(2025, 4, 28, 2, 14, 2, 530819, tzinfo=datetime.timezone.utc))
Delete expired data#
Danger
Garbage collection is an irreversible operation that deletes data. Use it with care.
Use Repository.garbage_collect
to delete data associated with expired snapshots
GCSummary(bytes_deleted=3838, chunks_deleted=4, manifests_deleted=4, snapshots_deleted=4, attributes_deleted=0, transaction_logs_deleted=4)