Skip to content

Home / icechunk-python / storage

Storage#

Icechunk can be configured to work with both object storage and filesystem backends. The storage configuration defines the location of an Icechunk store, along with any options or information needed to access data from a given storage type.

S3 Storage#

When using Icechunk with s3 compatible storage systems, credentials must be provided to allow access to the data on the given endpoint. Icechunk allows for creating the storage config for s3 in three ways:

With this option, the credentials for connecting to S3 are detected automatically from your environment. This is usually the best choice if you are connecting from within an AWS environment (e.g. from EC2). See the API

icechunk.s3_storage(
    bucket="icechunk-test",
    prefix="quickstart-demo-1",
    from_env=True
)

With this option, you provide your credentials and other details explicitly. See the API

icechunk.s3_storage(
    bucket="icechunk-test",
    prefix="quickstart-demo-1",
    region='us-east-1',
    access_key_id='my-access-key',
    secret_access_key='my-secret-key',
    # session token is optional
    session_token='my-token',
    endpoint_url=None,  # if using a custom endpoint
    allow_http=False,  # allow http connections (default is False)
)

With this option, you connect to S3 anonymously (without credentials). This is suitable for public data. See the API

icechunk.s3_storage(
    bucket="icechunk-test",
    prefix="quickstart-demo-1",
    region='us-east-1,
    anonymous=True,
)

With this option, you provide a callback function that will be called to obtain S3 credentials when needed. This is useful for workloads that depend on retrieving short-lived credentials from AWS or similar authority, allowing for credentials to be refreshed as needed without interrupting any workflows. See the API

def get_credentials() -> S3StaticCredentials:
    # In practice, you would use a function that actually fetches the credentials and returns them
    # along with an optional expiration time which will trigger this callback to run again
    return icechunk.S3StaticCredentials(
        access_key_id="xyz",
        secret_access_key="abc",å
        expires_after=datetime.now(UTC) + timedelta(days=1)
    )

icechunk.s3_storage(
    bucket="icechunk-test",
    prefix="quickstart-demo-1",
    region='us-east-1',
    get_credentials=get_credentials,
)

Tigris#

Tigris is available as a storage backend for Icechunk. Icechunk provides a helper function specifically for creating Tigris storage configurations.

icechunk.tigris_storage(
    bucket="icechunk-test",
    prefix="quickstart-demo-1",
    access_key_id='my-access-key',
    secret_access_key='my-secret-key',
)

Even is Tigris is API-compatible with S3, this function is needed because Tigris implements a different form of consistency. If instead you use s3_storage with the Tigris endpoint, Icechunk won't be able to achieve all its consistency guarantees.

Cloudflare R2#

Icechunk can use Cloudflare R2's S3-compatible API. You will need to: 1. provide either the account ID or set the endpoint URL specific to your bucket: https://<ACCOUNT_ID>.r2.cloudflarestorage.com; 2. create an API token to generate a secret access key and access key ID; and

icechunk.r2_storage(
    bucket="bucket-name",
    prefix="icechunk-test/quickstart-demo-1",
    access_key_id='my-access-key',
    secret_access_key='my-secret-key',
    account_id='my-account-id',
)

For buckets with public access,

icechunk.r2_storage(
    prefix="icechunk-test/quickstart-demo-1",
    endpoint_url="https://public-url,
)

Minio#

Minio is available as a storage backend for Icechunk. Functionally this storage backend is the same as S3 storage, but with a different endpoint.

For example, if we have a Minio server running at http://localhost:9000 with access key minio and secret key minio123 we can create a storage configuration as follows:

icechunk.s3_storage(
    bucket="icechunk-test",
    prefix="quickstart-demo-1",
    region='us-east-1',
    access_key_id='minio',
    secret_access_key='minio123',
    endpoint_url='http://localhost:9000',
    allow_http=True,
    force_path_style=True,

A few things to note:

  1. The endpoint_url parameter is set to the URL of the Minio server.
  2. If the Minio server is running over HTTP and not HTTPS, the allow_http parameter must be set to True.
  3. Even though this is running on a local server, the region parameter must still be set to a valid region. By default use us-east-1.

Google Cloud Storage#

Icechunk can be used with Google Cloud Storage.

With this option, the credentials for connecting to GCS are detected automatically from your environment. See the API

icechunk.gcs_storage(
    bucket="icechunk-test",
    prefix="quickstart-demo-1",
    from_env=True
)

With this option, you provide the path to a service account file. See the API

icechunk.gcs_storage(
    bucket="icechunk-test",
    prefix="quickstart-demo-1",
    service_account_file="/path/to/service-account.json"
)

With this option, you provide the service account key as a string. See the API

icechunk.gcs_storage(
    bucket="icechunk-test",
    prefix="quickstart-demo-1",
    service_account_key={
        "type": "service_account",
        "project_id": "my-project",
        "private_key_id": "my-private-key-id",
        "private_key": "-----BEGIN PRIVATE KEY-----\nmy-private-key\n-----END PRIVATE KEY-----\n",
        "client_email": "
        },
)

With this option, you use the application default credentials (ADC) to authentication with GCS. Provide the path to the credentials. See the API

icechunk.gcs_storage(
    bucket="icechunk-test",
    prefix="quickstart-demo-1",
    application_credentials="/path/to/application-credentials.json"
)

With this option, you provide a bearer token to use for the object store. This is useful for short lived workflows where expiration is not relevant or when the bearer token will not expire See the API

icechunk.gcs_storage(
    bucket="icechunk-test",
    prefix="quickstart-demo-1",
    bearer_token="my-bearer-token"
)

With this option, you provide a callback function that will be called to obtain GCS credentials when needed. This is useful for workloads that depend on retrieving short-lived credentials from GCS or similar authority, allowing for credentials to be refreshed as needed without interrupting any workflows. This works at a lower level than the other methods, and accepts a bearer token and expiration time. These are the same credentials that are created for you when specifying the service account file, key, or ADC. See the API

def get_credentials() -> GcsBearerCredential:
    # In practice, you would use a function that actually fetches the credentials and returns them
    # along with an optional expiration time which will trigger this callback to run again
    return icechunk.GcsBearerCredential(bearer="my-bearer-token", expires_after=datetime.now(UTC) + timedelta(days=1))

icechunk.gcs_storage(
    bucket="icechunk-test",
    prefix="quickstart-demo-1",
    get_credentials=get_credentials,
)

Limitations#

  • The consistency guarantees for GCS function differently than S3. Specifically, GCS uses the generation instead of etag for if-match put requests. Icechunk has not wired this through yet and thus configuration updating is potentially unsafe. This is not a problem for most use cases that are not frequently updating the configuration.
  • GCS does not yet support bearer tokens and auth refreshing. This means currently auth is limited to service account files.
  • The GCS storage config does not yet support anonymous access.

Azure Blob Storage#

Icechunk can be used with Azure Blob Storage.

With this option, the credentials for connecting to Azure Blob Storage are detected automatically from your environment. See the API

icechunk.azure_storage(
    account="my-account-name",
    container="icechunk-test",
    prefix="quickstart-demo-1",
    from_env=True
)

With this option, you provide your credentials and other details explicitly. See the API

icechunk.azure_storage(
    account_name='my-account-name',
    container="icechunk-test",
    prefix="quickstart-demo-1",
    account_key='my-account-key',
    access_token=None,  # optional
    sas_token=None,  # optional
    bearer_token=None,  # optional
)

Filesystem Storage#

Icechunk can also be used on a local filesystem by providing a path to the location of the store

icechunk.local_filesystem_storage("/path/to/my/dataset")

Limitations#

  • Icechunk currently does not work with a local filesystem storage backend on Windows. See this issue for more discussion. To work around, try using WSL or a cloud storage backend.

In Memory Storage#

While it should never be used for production data, Icechunk can also be used with an in-memory storage backend. This is useful for testing and development purposes. This is volatile and when the Python process ends, all data is lost.

icechunk.in_memory_storage()