Home / guides / virtual

Virtual Datasets#

While Icechunk works wonderfully with native chunks managed by Zarr, there is lots of archival data out there in other formats already. To interoperate with such data, Icechunk supports "Virtual" chunks, where any number of chunks in a given dataset may reference external data in existing archival formats, such as netCDF, HDF, GRIB, or TIFF. Virtual chunks are loaded directly from the original source without copying or modifying the original achival data files. This enables Icechunk to manage large datasets from existing data without needing that data to be in Zarr format already.

Note

The concept of a "virtual Zarr dataset" originates from the Kerchunk project, which preceded and inspired VirtualiZarr. Like VirtualiZarr, the kerchunk package provides functionality to scan metadata of existing data files and combine these references into larger virtual datasets, but unlike VirtualiZarr the Kerchunk package currently has no facility for writing to Icechunk stores. If you previously were interested in "Kerchunking" your data, you can now achieve a similar result by using VirtualiZarr to create virtual datasets and write them to icechunk.

VirtualiZarr lets users ingest existing data files into virtual datasets using various different tools under the hood, including kerchunk, xarray, zarr, and now icechunk. It does so by creating virtual references to existing data that can be combined and manipulated to create larger virtual datasets using xarray. These datasets can then be exported to kerchunk reference format or to an Icechunk repository, without ever copying or moving the existing data files.

Note

Currently, Icechunk supports virtual references to data stored in s3 compatible, gcs, azure, http/https, and local storage backends.

Security considerations with virtual chunks

Virtual chunks let Icechunk point to external locations (s3://, http://, file://, etc.), which means a malicious repo could try to trick your code into reading sensitive data from your machine or other sources.

To protect you, Icechunk is safe by default: it won't read from these locations unless you explicitly allow it. This requires (1) defining trusted virtual chunk containers when writing data, and (2) passing authorize_virtual_chunk_access when opening a repo, so you stay in control of what external paths get accessed.

Creating a virtual dataset with VirtualiZarr#

We are going to create a virtual dataset pointing to all of the OISST data for August 2024. This data is distributed publicly as netCDF files on AWS S3, with one netCDF file containing the Sea Surface Temperature (SST) data for each day of the month. We are going to use VirtualiZarr to combine all of these files into a single virtual dataset spanning the entire month, then write that dataset to Icechunk for use in analysis.

Before we get started, we need to install virtualizarr (version 2.4.0 or later) and icechunk. We also need to install fsspec and s3fs for discovering files on s3, and h5py for working with netCDF files.

pip install "virtualizarr>=2.4.0" icechunk fsspec s3fs h5py

First, we need to find all of the files we are interested in. We can do this with fsspec using a glob expression to find every netcdf file in the August 2024 folder in the bucket:

import fsspec

fs = fsspec.filesystem('s3', anon=True)

oisst_files = fs.glob('s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/202408/oisst-avhrr-v02r01.*.nc')

oisst_files = sorted(['s3://'+f for f in oisst_files])
#['s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/201001/oisst-avhrr-v02r01.20100101.nc',
# 's3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/201001/oisst-avhrr-v02r01.20100102.nc',
# 's3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/201001/oisst-avhrr-v02r01.20100103.nc',
# 's3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/data/v2.1/avhrr/201001/oisst-avhrr-v02r01.20100104.nc',
#...
#]

VirtualiZarr uses obstore to access remote files, and we need to create an ObjectStoreRegistry containing an obstore store for this bucket.

from obstore.store import from_url
from obspec_utils.registry import ObjectStoreRegistry

bucket = "s3://noaa-cdr-sea-surface-temp-optimum-interpolation-pds/"
store = from_url(bucket, region="us-east-1", skip_signature=True)
registry = ObjectStoreRegistry({bucket: store})

These are netCDF4 files, which are really HDF5 files, so we need to user virtualizarr's HDFParser.

from virtualizarr.parsers import HDFParser

Now that we have the filenames of the data we need, a way to access them, and a way to parse their contents, we can create virtual datasets with VirtualiZarr. This may take a minute, as it needs to fetch all the metadata from all the files.

from virtualizarr import open_virtual_dataset

virtual_datasets =[
    open_virtual_dataset(url, registry=registry, parser=HDFParser())
    for url in oisst_files
]

We can now use xarray to combine these virtual datasets into one large virtual dataset (For more details on this operation see VirtualiZarr's documentation). We know that each of our files share the same structure but with a different date. So we are going to concatenate these datasets on the time dimension.

import xarray as xr

virtual_ds = xr.concat(
    virtual_datasets,
    dim='time',
    coords='minimal',
    compat='override',
    combine_attrs='override'
)

#<xarray.Dataset> Size: 257MB
#Dimensions:  (time: 31, zlev: 1, lat: 720, lon: 1440)
#Coordinates:
#    time     (time) float32 124B ManifestArray<shape=(31,), dtype=float32, ch...
#    lat      (lat) float32 3kB ManifestArray<shape=(720,), dtype=float32, chu...
#    zlev     (zlev) float32 4B ManifestArray<shape=(1,), dtype=float32, chunk...
#    lon      (lon) float32 6kB ManifestArray<shape=(1440,), dtype=float32, ch...
#Data variables:
#    sst      (time, zlev, lat, lon) int16 64MB ManifestArray<shape=(31, 1, 72...
#    anom     (time, zlev, lat, lon) int16 64MB ManifestArray<shape=(31, 1, 72...
#    ice      (time, zlev, lat, lon) int16 64MB ManifestArray<shape=(31, 1, 72...
#    err      (time, zlev, lat, lon) int16 64MB ManifestArray<shape=(31, 1, 72...

We have a virtual dataset with 31 timestamps! One hint that this worked correctly is that the readout shows the variables and coordinates as ManifestArray instances, the representation that VirtualiZarr uses for virtual arrays. Let's create an Icechunk repo to write this dataset to in the oisst directory on our local filesystem.

Note

Take note of the VirtualChunkContainer passed into the RepositoryConfig when creating the store. We specify the storage configuration necessary to access the anonymous S3 bucket that holds the OISST netCDF files, along with credentials that match. This creates a mapping between the s3 virtual chunk container and the credentials passed for the s3 namespace. For more configuration options, see the configuration page.

import icechunk as ic

storage = ic.local_filesystem_storage(
    path='oisst',
)

config = ic.config.RepositoryConfig.default()
config.set_virtual_chunk_container(ic.virtual.VirtualChunkContainer("s3://mybucket/my/data/", ic.storage.s3_store(region="us-east-1")))
credentials = ic.credentials.containers_credentials({"s3://mybucket/my/data/": ic.credentials.s3_credentials(anonymous=True)})
repo = ic.Repository.create(storage, config, credentials)

Note

The updated configuration will only be persisted across sessions if you call repo.save_config. This is therefore recommended, so that users reading the virtual chunks in later sessions do not also have to set the virtual containers.

With the repo created, and the virtual chunk container added, lets write our virtual dataset to Icechunk with VirtualiZarr!

session = repo.writable_session("main")
virtual_ds.vz.to_icechunk(session.store)

The refs are written so lets save our progress by committing to the store.

Note

Your commit hash will be different! For more on the version control features of Icechunk, see the version control page.

session.commit("My first virtual store!")

# 'THAJHTYQABGD2B10D5C0'

Now we can read the dataset from the store using xarray to confirm everything went as expected. xarray reads directly from the Icechunk store because it is a fully compliant zarr Store instance.

ds = xr.open_zarr(
    session.store,
    zarr_version=3,
    chunks={},
)

#<xarray.Dataset> Size: 1GB
#Dimensions:  (lon: 1440, time: 31, zlev: 1, lat: 720)
#Coordinates:
#  * lon      (lon) float32 6kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
#  * zlev     (zlev) float32 4B 0.0
#  * time     (time) datetime64[ns] 248B 2024-08-01T12:00:00 ... 2024-08-31T12...
#  * lat      (lat) float32 3kB -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
#Data variables:
#    sst      (time, zlev, lat, lon) float64 257MB dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray>
#    ice      (time, zlev, lat, lon) float64 257MB dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray>
#    anom     (time, zlev, lat, lon) float64 257MB dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray>
#    err      (time, zlev, lat, lon) float64 257MB dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray>

Success! We have created our full dataset with 31 timesteps spanning the month of august, all with virtual references to pre-existing data files in object store. This means we can now version control our dataset, allowing us to update it, and roll it back to a previous version without copying or moving any data from the original files.

Finally, let's make a plot of the sea surface temperature!

ds.sst.isel(time=26, zlev=0).plot(x='lon', y='lat', vmin=0)

oisst

Appending to an existing store#

You can append new virtual data to an existing Icechunk store using to_icechunk's append_dim argument. See the VirtualiZarr docs on appending for details.

You can also write to a specific region of an existing store using to_icechunk's region argument.

Note

Users of the repo will need to enable the virtual chunk container by passing the authorize_virtual_chunk_access argument to Repository.open. This way, the repo user flags the container as authorized. The argument must be a dict using url prefixes as keys and explicit credentials as values. If the container requires no credentials, use the icechunk.credentials.LocalFileSystemAccess (file://) or icechunk.credentials.HttpAccess (http(s)://) sentinel as the value. Failing to authorize a container will generate an error when a chunk is fetched from it.

Passing None as the value is deprecated and will be unsupported in a future release; use an explicit credential or no-auth sentinel instead.

Relative Virtual Chunk Containers#

By default, every virtual chunk stores a full absolute URL like s3://bucket/prefix/path/to/chunk.nc. However, this locks in the location of the chunk - if you move your data to a different bucket or cloud provider, you would need to rewrite all the chunk reference URLs.

To address this, you can give a VirtualChunkContainer a name. Chunks can then use vcc:// relative URLs instead of full absolute paths:

config = ic.config.RepositoryConfig.default()
config.set_virtual_chunk_container(
    ic.virtual.VirtualChunkContainer(
        "s3://mybucket/my/data/",
        ic.storage.s3_store(region="us-east-1"),
        name="my-data",  # give the container a name
    )
)

With a named container, chunk locations can be written as vcc://my-data/file.nc instead of s3://mybucket/my/data/file.nc. The vcc:// scheme tells Icechunk to look up the container by name and prepend its url_prefix to resolve the full path at read time.

Now, if you move the underlying data, you only need to update the VCC's url_prefix in the config — all the relative refs will still resolve correctly without rewriting manifests.

Note

Container names must be unique within a repository. Authorization still uses URL prefixes (not names), so named containers do not change the security model.

Virtual Reference API#

While VirtualiZarr is the easiest way to create virtual datasets with Icechunk, the Store API that it uses to create the datasets in Icechunk is public. IcechunkStore contains a set_virtual_ref method that specifies a virtual ref for a specified chunk. It also has a set_virtual_refs method for setting many virtual chunk references at once.

URL prefix must end with a trailing slash

When creating a VirtualChunkContainer, the URL prefix must end with a trailing slash (/). For example, use "s3://mybucket/my/data/" not "s3://mybucket/my/data".

Virtual Reference Storage Support#

Currently, Icechunk supports four types of storage for virtual references:

S3 Compatible#

References to files accessible via S3 compatible storage.

Example#

Here is how we can set the chunk at key c/0 to point to a file on an s3 bucket,mybucket, with the prefix my/data/file.nc:

config = ic.config.RepositoryConfig.default()
config.set_virtual_chunk_container(ic.virtual.VirtualChunkContainer("s3://mybucket/my/data/", ic.storage.s3_store(region="us-east-1")))
repo = ic.Repository.create(storage, config)
session = repo.writable_session("main")
session.store.set_virtual_ref('c/0', 's3://mybucket/my/data/file.nc', offset=1000, length=200)

Configuration#

S3 virtual references require configuring credential for the store to be able to access the specified s3 bucket. See the configuration docs for instructions.

GCS#

References to files accessible on Google Cloud Storage

Example#

Here is how we can set the chunk at key c/0 to point to a file on an s3 bucket,mybucket, with the prefix my/data/file.nc:

config = ic.config.RepositoryConfig.default()
config.set_virtual_chunk_container(ic.virtual.VirtualChunkContainer("gcs://mybucket/my/data/", ic.storage.gcs_store(opts={})))
repo = ic.Repository.create(storage, config)
session = repo.writable_session("main")
session.store.set_virtual_ref('c/0', 'gcs://mybucket/my/data/file.nc', offset=1000, length=200)

HTTP#

References to files accessible via http(s) protocol

Example#

Here is how we can set the chunk at key c/0 to point to a file on myserver, with the prefix my/data/file.nc:

config = ic.config.RepositoryConfig.default()
config.set_virtual_chunk_container(ic.virtual.VirtualChunkContainer("https://myserver/my/data/", ic.storage.http_store(opts={})))
repo = ic.Repository.create(storage, config)
session = repo.writable_session("main")
session.store.set_virtual_ref('c/0', 'https://myserver/my/data/file.nc', offset=1000, length=200)

To read these references back, a repo user authorizes the container with the icechunk.credentials.HttpAccess sentinel (HTTP containers need no credentials):

repo = ic.Repository.open(
    storage,
    authorize_virtual_chunk_access={"https://myserver/my/data/": ic.credentials.HttpAccess},
)

Local Filesystem#

References to files accessible via local filesystem. This requires any file paths to be absolute at this time.

Example#

Here is how we can set the chunk at key c/0 to point to a file on my local filesystem located at /path/to/my/file.nc:

config = ic.config.RepositoryConfig.default()
config.set_virtual_chunk_container(ic.virtual.VirtualChunkContainer("file:///path/to/my/", ic.storage.local_filesystem_store("/path/to/my")))
repo = ic.Repository.create(storage, config)
session = repo.writable_session("main")
session.store.set_virtual_ref('c/0', 'file:///path/to/my/file.nc', offset=20, length=100)

No extra configuration is necessary for local filesystem references. To read these references back, a repo user authorizes the container with the icechunk.credentials.LocalFileSystemAccess sentinel (local-filesystem containers need no credentials):

repo = ic.Repository.open(
    storage,
    authorize_virtual_chunk_access={"file:///path/to/my/": ic.credentials.LocalFileSystemAccess},
)

Virtual Reference File Format Support#

Icechunk supports storing virtual references to any format that VirtualiZarr can parse. VirtualiZarr ships with parsers for a range of formats, including HDF5, netcdf4, netcdf3, TIFF/GeoTIFF, and Zarr (v2 and v3). You can also write your own custom parser for virtualizing other file formats.

Support for other common filetypes is under development within the VirtualiZarr project. Below are some relevant issues: