GeoTIFF backend internals: contract inventory#
Phase 1 deliverable for the GeoTIFF refactor epic (issue #2211). This page captures the current shape of the GeoTIFF read and write backends and shows which contract steps already route through shared helpers and which are still implemented inline per backend. It is the reference future patches use when deciding whether a fix belongs in the shared helper or in a single backend.
This is developer-facing internal documentation. Nothing here is part of the
public API. Files referenced live under xrspatial/geotiff/.
Backend entry points#
Read#
Entry point |
File |
Returns |
|---|---|---|
|
|
dispatcher (NumPy / CuPy / Dask / Dask+CuPy / VRT) |
|
|
Dask-NumPy DataArray |
|
|
CuPy or Dask-CuPy DataArray |
|
|
NumPy / CuPy / Dask DataArray (mosaic) |
Write#
Entry point |
File |
Input |
|---|---|---|
|
|
NumPy / Dask DataArray (auto-dispatches to GPU when input is CuPy-backed) |
|
|
CuPy DataArray |
|
|
list of GeoTIFF paths (XML emitter) |
Contract steps#
Every read backend executes the same eight logical steps between the source kwarg and the returned DataArray. Every writer backend executes the inverse of steps 1, 3, 4, 6, 7 (no decode, no orientation). The eight read steps:
Source / kwarg validation. Reject conflicting flags, bad chunks, unsupported
overview_leveltypes, unsafe URLs, file-like buffers paired with GPU/dask, etc.Metadata parse. Read the IFD chain, geokeys, geotags, and any
.tif.ovrsidecar. Resolve the requested overview level into a single IFD.Transform / georef classification. Decide whether the file is fully georeferenced, transform-only, CRS-only, no-georef, or rotated. Derive the
georef_statusvalue.Pixel decode. Strip or tile decompress, applying the codec, predictor, sample format, and planar config.
Orientation / photometric handling. Apply the TIFF orientation tag and the MinIsWhite photometric inversion.
Nodata mask + dtype cast. Replace the sentinel with NaN (promoting integer dtypes to float64 when needed), honour
mask_nodata=False, apply any callerdtype=cast.Attrs finalization. Stamp the canonical
transform,crs,crs_wkt,nodata,masked_nodata,nodata_pixels_present,nodata_dtype_cast,georef_status,rotated_affine,vrt_holes, and rich-tag attrs.DataArray construction. Build coords (
y,x, optionalband) and wrap the buffer in anxr.DataArray.
The write contract collapses to: validate kwargs, validate DataArray dims / shape / attrs, resolve transform, resolve CRS, resolve nodata, plan output layout (tile / strip / COG / BigTIFF), encode, and write IFD bytes.
Per-backend coverage#
Cells mark each contract step against each backend as shared (uses the helper named in parentheses), duplicated (helper exists but the backend still inlines the logic, or runs an extra inline check on top of the helper), or N/A (the step does not apply to that backend). Documented divergences are explicit deviations the refactor is keeping for now and that the call-site comments justify.
Read backends#
Step |
|
|
|
|
|
|
|---|---|---|---|---|---|---|
1. source / kwarg validation |
shared ( |
shared ( |
shared ( |
shared ( |
shared ( |
shared ( |
2. metadata parse |
shared ( |
shared ( |
shared ( |
shared ( |
duplicated ( |
duplicated ( |
3. transform / georef classification |
shared ( |
shared ( |
shared ( |
shared ( |
shared ( |
shared ( |
4. pixel decode |
shared ( |
shared (per-chunk |
duplicated (inline GDS / KvikIO / nvCOMP path with CPU fallback via |
duplicated (inline GDS + per-chunk delayed; HTTP / fsspec / stripped layouts fall back to |
duplicated ( |
duplicated (per-chunk |
5. orientation / photometric |
shared ( |
shared (per chunk via |
shared on CPU-fallback ( |
shared on CPU-fallback; duplicated on disk-to-GPU per-chunk path ( |
duplicated (inline NaN masking in |
duplicated (per chunk same as eager VRT) |
6. nodata mask + dtype cast |
shared ( |
duplicated (per-chunk mask inline in |
shared ( |
duplicated (per-chunk mask inline in |
duplicated ( |
duplicated (per-chunk integer sentinel mask via |
7. attrs finalization |
shared ( |
shared ( |
shared ( |
shared ( |
shared ( |
shared ( |
8. DataArray construction |
shared ( |
duplicated (assembles dask graph + builds |
shared ( |
duplicated (assembles cupy dask graph + builds |
duplicated (builds |
duplicated (per-chunk delayed graph + inline |
Write backends#
The TIFF write contract is the inverse of the read contract: validate the
DataArray, resolve transform / CRS / nodata from the attrs, lay out the
output, encode, and emit bytes. Steps 4 and 5 (decode, orientation) have no
write analogue; to_geotiff and write_geotiff_gpu always emit
Orientation = 1 and rely on the writer assembler (_writer.write) for
photometric handling.
Step |
|
|
|
|---|---|---|---|
1. source / kwarg validation |
shared ( |
shared ( |
shared ( |
2. metadata parse |
N/A (no source to parse; reads attrs off the DataArray) |
N/A |
duplicated (reads geokeys from the first source file to inherit CRS / nodata; lives in |
3. transform / georef classification |
shared ( |
shared (same helpers as CPU eager) |
duplicated (VRT XML emitter reads transform from the first source; CRS resolved via |
4. pixel decode |
N/A |
N/A |
N/A |
5. orientation / photometric |
duplicated (writer assembler |
duplicated (same; emits Orientation=1) |
N/A |
6. nodata mask + dtype cast |
shared ( |
shared ( |
duplicated (VRT carries the per-band |
7. attrs finalization |
duplicated (writer pulls |
duplicated (same inline pull as CPU writer) |
duplicated (VRT XML emitter writes attrs the source files carry) |
8. DataArray construction |
N/A (write path); the eager writer optionally returns the path string |
N/A |
N/A (returns the VRT path) |
Intended flow#
The Phase 2 – Phase 6 work named in #2211 moves the duplicated cells above onto the helpers below. Each future PR should land in one row of one table. New backend code added between now and then should call these helpers directly rather than re-inlining the logic.
Read flow#
source kwarg
-> _validate_dispatch_kwargs (step 1; bundles _validate_overview_level_arg)
-> _validate_chunks_arg as needed
-> _read_geo_info / _parse_cog_http_meta (step 2)
-> extract_geo_info_with_overview_inheritance
-> validate_read_metadata (step 3 -- rotated / unparseable / mixed-band)
-> read_to_array (steps 4 + 5; per-chunk for dask)
-> _finalize_eager_read (steps 6 + 7 + 8 for eager backends)
OR
_finalize_lazy_read_attrs (steps 7 only; caller assembles graph for 4 + 6 + 8)
Canonical helper per step:
Step |
Canonical helper |
|---|---|
1. kwarg validation |
|
2. metadata parse |
|
3. transform classification |
|
4. pixel decode |
|
5. orientation / photometric |
|
6. nodata mask + dtype cast |
|
7. attrs finalization |
|
8. DataArray construction |
|
Write flow#
DataArray
-> _validate_3d_writer_dims, _validate_writer_spatial_shape (step 1)
-> _validate_no_rotated_affine, _validate_nodata_arg, _validate_tile_size_arg
-> _resolve_spatial_coords, _has_no_georef_marker
-> _transform_from_attr (preferred) or _coords_to_transform (fallback) (step 3)
-> _require_transform_for_georeferenced
-> _validate_crs_arg / _wkt_to_epsg / _resolve_crs_to_wkt
-> validate_write_metadata (step 3 -- conflicting CRS / nodata / non-uniform coords)
-> _resolve_nodata_attr, _should_restore_nan_sentinel (step 6)
-> _extract_rich_tags (step 7 -- attrs surface)
-> _writer.write / GPU encoder
Canonical helper per step:
Step |
Canonical helper |
|---|---|
1. kwarg / dim validation |
|
3. transform / CRS / nodata |
|
6. nodata cast / restore |
|
7. attrs surface |
|
Phase plan reference#
Phase |
Target rows from the per-backend tables |
|---|---|
1 (this doc) |
none – inventory only |
2 |
step 3 (transform / georef) across every read and write backend |
3 |
step 6 (nodata mask + dtype cast) across every read backend |
4 |
step 7 (attrs finalization) across every read backend |
5 |
step 2 (metadata parse) extraction into |
6 |
cross-backend parity tests covering every step on every backend |
See #2211 for the full epic plan. Subsequent PRs in this series each target one row of one table and land behind cross-backend parity tests.