Skip to content

Home / icechunk-python / expiration

Expiring Data#

Over time, an Icechunk Repository will accumulate many snapshots, not all of which need to be kept around.

"Expiration" allows you to mark snapshots as expired, and "garbage collection" deletes all data (manifests, chunks, snapshots, etc.) associated with expired snapshots.

First create a Repository, configured so that there are no "inline" chunks. This will help illustrate that data is actually deleted.

import icechunk

repo = icechunk.Repository.create(
    icechunk.in_memory_storage(),
    config=icechunk.RepositoryConfig(inline_chunk_threshold_bytes=0),
)

Generate a few snapshots#

Let us generate a sequence of snapshots

import zarr
import time

for i in range(10):
    session = repo.writable_session("main")
    array = zarr.create_array(
        session.store, name="array", shape=(10,), fill_value=-1, dtype=int, overwrite=True
    )
    array[:] = i
    session.commit(f"snap {i}")
    time.sleep(0.1)

There are 10 snapshots

ancestry = list(repo.ancestry(branch="main"))
print("\n\n".join([str((a.id, a.written_at)) for a in ancestry]))

('KH8QXTV32Z78HQSE1V70', datetime.datetime(2025, 4, 28, 2, 14, 3, 485117, tzinfo=datetime.timezone.utc))

('BTP7SX6Y67TCEZKQR2N0', datetime.datetime(2025, 4, 28, 2, 14, 3, 379788, tzinfo=datetime.timezone.utc))

('TK9FB94FZ4E01R7BPRYG', datetime.datetime(2025, 4, 28, 2, 14, 3, 274442, tzinfo=datetime.timezone.utc))

('QJ55STRHH0ZS79DW6C9G', datetime.datetime(2025, 4, 28, 2, 14, 3, 169301, tzinfo=datetime.timezone.utc))

('PHEVNETB7BW3BWMS9ZW0', datetime.datetime(2025, 4, 28, 2, 14, 3, 64182, tzinfo=datetime.timezone.utc))

('2HADTNH0FTATDQ0Q0WN0', datetime.datetime(2025, 4, 28, 2, 14, 2, 958957, tzinfo=datetime.timezone.utc))

('926GA6V46JQMZHSWATB0', datetime.datetime(2025, 4, 28, 2, 14, 2, 853904, tzinfo=datetime.timezone.utc))

('A5S2DM6GBQ9N232AKCWG', datetime.datetime(2025, 4, 28, 2, 14, 2, 748661, tzinfo=datetime.timezone.utc))

('FKNQ6TR1FX7C12V2AY1G', datetime.datetime(2025, 4, 28, 2, 14, 2, 643432, tzinfo=datetime.timezone.utc))

('AWKDY4WPG2P9E36JXR10', datetime.datetime(2025, 4, 28, 2, 14, 2, 538001, tzinfo=datetime.timezone.utc))

('KP4446Y9N64A324MVMS0', datetime.datetime(2025, 4, 28, 2, 14, 2, 530819, tzinfo=datetime.timezone.utc))

Expire snapshots#

Danger

Expiring snapshots is an irreversible operation. Use it with care.

First we must expire snapshots. Here we will expire any snapshot older than the 5th one.

expiry_time = ancestry[5].written_at
print(expiry_time)

2025-04-28 02:14:02.958957+00:00

expired = repo.expire_snapshots(older_than=expiry_time)
print(expired)

{'A5S2DM6GBQ9N232AKCWG', 'FKNQ6TR1FX7C12V2AY1G', '926GA6V46JQMZHSWATB0', 'AWKDY4WPG2P9E36JXR10'}

This prints out the set of snapshots that were expired.

Note

The first snapshot is never expired!

Confirm that these are the right snapshots (remember that ancestry list commits in decreasing order of written_at time):

print([a.id for a in ancestry[-5:-1]])

['926GA6V46JQMZHSWATB0', 'A5S2DM6GBQ9N232AKCWG', 'FKNQ6TR1FX7C12V2AY1G', 'AWKDY4WPG2P9E36JXR10']

Note that ancestry is now shorter:

new_ancestry = list(repo.ancestry(branch="main"))
print("\n\n".join([str((a.id, a.written_at)) for a in new_ancestry]))

('KH8QXTV32Z78HQSE1V70', datetime.datetime(2025, 4, 28, 2, 14, 3, 485117, tzinfo=datetime.timezone.utc))

('BTP7SX6Y67TCEZKQR2N0', datetime.datetime(2025, 4, 28, 2, 14, 3, 379788, tzinfo=datetime.timezone.utc))

('TK9FB94FZ4E01R7BPRYG', datetime.datetime(2025, 4, 28, 2, 14, 3, 274442, tzinfo=datetime.timezone.utc))

('QJ55STRHH0ZS79DW6C9G', datetime.datetime(2025, 4, 28, 2, 14, 3, 169301, tzinfo=datetime.timezone.utc))

('PHEVNETB7BW3BWMS9ZW0', datetime.datetime(2025, 4, 28, 2, 14, 3, 64182, tzinfo=datetime.timezone.utc))

('2HADTNH0FTATDQ0Q0WN0', datetime.datetime(2025, 4, 28, 2, 14, 2, 958957, tzinfo=datetime.timezone.utc))

('KP4446Y9N64A324MVMS0', datetime.datetime(2025, 4, 28, 2, 14, 2, 530819, tzinfo=datetime.timezone.utc))

Delete expired data#

Danger

Garbage collection is an irreversible operation that deletes data. Use it with care.

Use Repository.garbage_collect to delete data associated with expired snapshots

results = repo.garbage_collect(expiry_time)
print(results)

GCSummary(bytes_deleted=3838, chunks_deleted=4, manifests_deleted=4, snapshots_deleted=4, attributes_deleted=0, transaction_logs_deleted=4)