Home / icechunk-python / performance
Performance#
Info
This is advanced material, and you will need it only if you have arrays with more than a million chunks. Icechunk aims to provide an excellent experience out of the box.
Scalability#
Icechunk is designed to be cloud native, making it able to take advantage of the horizontal scaling of cloud providers. To learn more, check out this blog post which explores just how well Icechunk can perform when matched with AWS S3.
Preloading manifests#
Coming Soon.
Splitting manifests#
Icechunk stores chunk references in a chunk manifest file stored in manifests/
. For very large arrays (millions of chunks), these files can get quite large. By default, Icechunk stores all chunk references in a single manifest file per array. Requesting even a single chunk requires downloading the entire manifest. In some cases, this can result in a slow time-to-first-byte or large memory usage.
Note
Note that the chunk sizes in the following examples are tiny for demonstration purposes.
To avoid that, Icechunk lets you split the manifest files by specifying a ManifestSplittingConfig
.
import icechunk as ic
from icechunk import ManifestSplitCondition, ManifestSplittingConfig, ManifestSplitDimCondition
split_config = ManifestSplittingConfig.from_dict(
{
ManifestSplitCondition.AnyArray(): {
ManifestSplitDimCondition.DimensionName("time"): 365 * 24
}
}
)
repo_config = ic.RepositoryConfig(manifest=ic.ManifestConfig(splitting=split_config))
Then pass the config to Repository.open
or Repository.create
This particular example splits manifests so that each manifest contains 365 * 24
chunks along the time dimension, and every chunk along every other dimension in a single file.
Options for specifying the arrays whose manifest you want to split are:
ManifestSplitCondition.name_matches
takes a regular expression used to match an array's name;ManifestSplitCondition.path_matches
takes a regular expression used to match an array's path;ManifestSplitCondition.and_conditions
to combine (1), (2), and (4) together; andManifestSplitCondition.or_conditions
to combine (1), (2), and (3) together.
And
and Or
may be used to combine multiple path and/or name matches. For example,
array_condition = ManifestSplitCondition.or_conditions(
[
ManifestSplitCondition.name_matches("temperature"),
ManifestSplitCondition.name_matches("salinity"),
]
)
sconfig = ManifestSplittingConfig.from_dict(
{array_condition: {ManifestSplitDimCondition.DimensionName("longitude"): 3}}
)
Options for specifying how to split along a specific axis or dimension are:
ManifestSplitDimCondition.Axis
takes an integer axis;ManifestSplitDimCondition.DimensionName
takes a regular expression used to match the dimension names of the array;ManifestSplitDimCondition.Any
matches any remaining dimension name or axis.
For example, for an array with dimensions time, latitude, longitude
, the following config
from icechunk import ManifestSplitDimCondition
{
ManifestSplitDimCondition.DimensionName("longitude"): 3,
ManifestSplitDimCondition.Axis(1): 2,
ManifestSplitDimCondition.Any(): 1,
}
Note
Python dictionaries preserve insertion order, so the first condition encountered takes priority.