Home / icechunk-python / storage
Storage
Icechunk can be configured to work with both object storage and filesystem backends. The storage configuration defines the location of an Icechunk store, along with any options or information needed to access data from a given storage type.
S3 Storage
When using Icechunk with s3 compatible storage systems, credentials must be provided to allow access to the data on the given endpoint. Icechunk allows for creating the storage config for s3 in three ways:
With this option, the credentials for connecting to S3 are detected automatically from your environment. This is usually the best choice if you are connecting from within an AWS environment (e.g. from EC2). See the API
With this option, you provide your credentials and other details explicitly. See the API
icechunk.s3_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
region='us-east-1',
access_key_id='my-access-key',
secret_access_key='my-secret-key',
# session token is optional
session_token='my-token',
endpoint_url=None, # if using a custom endpoint
allow_http=False, # allow http connections (default is False)
)
With this option, you connect to S3 anonymously (without credentials). This is suitable for public data. See the API
With this option, you provide a callback function that will be called to obtain S3 credentials when needed. This is useful for workloads that depend on retrieving short-lived credentials from AWS or similar authority, allowing for credentials to be refreshed as needed without interrupting any workflows. See the API
def get_credentials() -> S3StaticCredentials:
# In practice, you would use a function that actually fetches the credentials and returns them
# along with an optional expiration time which will trigger this callback to run again
return icechunk.S3StaticCredentials(
access_key_id="xyz",
secret_access_key="abc",å
expires_after=datetime.now(UTC) + timedelta(days=1)
)
icechunk.s3_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
region='us-east-1',
get_credentials=get_credentials,
)
AWS Limitations
- Icechunk is currently incompatible with S3 Express One Zone. See this issue for more discussion.
Tigris
Tigris is available as a storage backend for Icechunk. Functionally this storage backend is the same as S3 storage, but with a different endpoint. Icechunk provides a helper function specifically for creating Tigris storage configurations.
icechunk.tigris_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
access_key_id='my-access-key',
secret_access_key='my-secret-key',
)
There are a few things to be aware of when using Tigris: - Tigris is a globally distributed object store by default. The caveat is that Tigris does not currently support the full consistency guarantees when the store is distributed across multiple regions. For now, to get all the consistency guarantees Icechunk offers, you will need to setup your Tigris bucket as restricted to a single region. This can be done by setting the region in the Tigris bucket settings:
Minio
Minio is available as a storage backend for Icechunk. Functionally this storage backend is the same as S3 storage, but with a different endpoint.
For example, if we have a Minio server running at http://localhost:9000
with access key minio
and secret key minio123
we can create a storage configuration as follows:
icechunk.s3_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
region='us-east-1',
access_key_id='minio',
secret_access_key='minio123',
endpoint_url='http://localhost:9000',
allow_http=True,
A few things to note:
- The
endpoint_url
parameter is set to the URL of the Minio server. - If the Minio server is running over HTTP and not HTTPS, the
allow_http
parameter must be set toTrue
. - Even though this is running on a local server, the
region
parameter must still be set to a valid region. By default useus-east-1
.
Google Cloud Storage
Icechunk can be used with Google Cloud Storage.
With this option, the credentials for connecting to GCS are detected automatically from your environment. See the API
With this option, you provide the path to a service account file. See the API
With this option, you provide the service account key as a string. See the API
icechunk.gcs_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
service_account_key={
"type": "service_account",
"project_id": "my-project",
"private_key_id": "my-private-key-id",
"private_key": "-----BEGIN PRIVATE KEY-----\nmy-private-key\n-----END PRIVATE KEY-----\n",
"client_email": "
},
)
With this option, you use the [application default credentials (ADC)](https://cloud.google.com/docs/authentication/provide-credentials-adc) to authentication with GCS. Provide the path to the credentials. [See the API](./reference.md#icechunk.gcs_storage)
```python
icechunk.gcs_storage(
bucket="icechunk-test",
prefix="quickstart-demo-1",
application_credentials="/path/to/application-credentials.json"
)
```
Limitations
- The consistency guarantees for GCS function differently than S3. Specifically, GCS uses the generation instead of etag for
if-match
put
requests. Icechunk has not wired this through yet and thus configuration updating is potentially unsafe. This is not a problem for most use cases that are not frequently updating the configuration. - GCS does not yet support
bearer
tokens and auth refreshing. This means currently auth is limited to service account files. - The GCS storage config does not yet support anonymous access.
Azure Blob Storage
Icechunk can be used with Azure Blob Storage.
With this option, the credentials for connecting to Azure Blob Storage are detected automatically from your environment. See the API
With this option, you provide your credentials and other details explicitly. See the API
Filesystem Storage
Icechunk can also be used on a local filesystem by providing a path to the location of the store
Limitations
- Icechunk currently does not work with a local filesystem storage backend on Windows. See this issue for more discussion. To work around, try using WSL or a cloud storage backend.
In Memory Storage
While it should never be used for production data, Icechunk can also be used with an in-memory storage backend. This is useful for testing and development purposes. This is volatile and when the Python process ends, all data is lost.