Skip to content

Home / icechunk-python / performance

Performance#

Info

This is advanced material, and you will need it only if you have arrays with more than a million chunks. Icechunk aims to provide an excellent experience out of the box.

Scalability#

Icechunk is designed to be cloud native, making it able to take advantage of the horizontal scaling of cloud providers. To learn more, check out this blog post which explores just how well Icechunk can perform when matched with AWS S3.

Preloading manifests#

Coming Soon.

Splitting manifests#

Icechunk stores chunk references in a chunk manifest file stored in manifests/. For very large arrays (millions of chunks), these files can get quite large. By default, Icechunk stores all chunk references in a single manifest file per array. Requesting even a single chunk requires downloading the entire manifest. In some cases, this can result in a slow time-to-first-byte or large memory usage.

Note

Note that the chunk sizes in the following examples are tiny for demonstration purposes.

To avoid that, Icechunk lets you split the manifest files by specifying a ManifestSplittingConfig.

import icechunk as ic
from icechunk import ManifestSplitCondition, ManifestSplittingConfig, ManifestSplitDimCondition

split_config = ManifestSplittingConfig.from_dict(
    {
        ManifestSplitCondition.AnyArray(): {
            ManifestSplitDimCondition.DimensionName("time"): 365 * 24
        }
    }
)
repo_config = ic.RepositoryConfig(manifest=ic.ManifestConfig(splitting=split_config))

Then pass the config to Repository.open or Repository.create

repo = ic.Repository.open(..., config=repo_config)

This particular example splits manifests so that each manifest contains 365 * 24 chunks along the time dimension, and every chunk along every other dimension in a single file.

Options for specifying the arrays whose manifest you want to split are:

  1. ManifestSplitCondition.name_matches takes a regular expression used to match an array's name;
  2. ManifestSplitCondition.path_matches takes a regular expression used to match an array's path;
  3. ManifestSplitCondition.and_conditions to combine (1), (2), and (4) together; and
  4. ManifestSplitCondition.or_conditions to combine (1), (2), and (3) together.

And and Or may be used to combine multiple path and/or name matches. For example,

array_condition = ManifestSplitCondition.or_conditions(
    [
        ManifestSplitCondition.name_matches("temperature"),
        ManifestSplitCondition.name_matches("salinity"),
    ]
)
sconfig = ManifestSplittingConfig.from_dict(
    {array_condition: {ManifestSplitDimCondition.DimensionName("longitude"): 3}}
)

Options for specifying how to split along a specific axis or dimension are:

  1. ManifestSplitDimCondition.Axis takes an integer axis;
  2. ManifestSplitDimCondition.DimensionName takes a regular expression used to match the dimension names of the array;
  3. ManifestSplitDimCondition.Any matches any remaining dimension name or axis.

For example, for an array with dimensions time, latitude, longitude, the following config

from icechunk import ManifestSplitDimCondition

{
    ManifestSplitDimCondition.DimensionName("longitude"): 3,
    ManifestSplitDimCondition.Axis(1): 2,
    ManifestSplitDimCondition.Any(): 1,
}
will result in splitting manifests so that each manifest contains (3 longitude chunks x 2 latitude chunks x 1 time chunk) = 6 chunks per manifest file.

Note

Python dictionaries preserve insertion order, so the first condition encountered takes priority.