GeoTIFF backend internals: contract inventory#

Phase 1 deliverable for the GeoTIFF refactor epic (issue #2211). This page captures the current shape of the GeoTIFF read and write backends and shows which contract steps already route through shared helpers and which are still implemented inline per backend. It is the reference future patches use when deciding whether a fix belongs in the shared helper or in a single backend.

This is developer-facing internal documentation. Nothing here is part of the public API. Files referenced live under xrspatial/geotiff/.

Backend entry points#

Read#

Entry point

File

Returns

open_geotiff

xrspatial/geotiff/__init__.py

dispatcher (NumPy / CuPy / Dask / Dask+CuPy / VRT)

read_geotiff_dask

xrspatial/geotiff/_backends/dask.py

Dask-NumPy DataArray

read_geotiff_gpu

xrspatial/geotiff/_backends/gpu.py

CuPy or Dask-CuPy DataArray

read_vrt

xrspatial/geotiff/_backends/vrt.py

NumPy / CuPy / Dask DataArray (mosaic)

Write#

Entry point

File

Input

to_geotiff

xrspatial/geotiff/_writers/eager.py

NumPy / Dask DataArray (auto-dispatches to GPU when input is CuPy-backed)

write_geotiff_gpu

xrspatial/geotiff/_writers/gpu.py

CuPy DataArray

write_vrt

xrspatial/geotiff/_writers/vrt.py

list of GeoTIFF paths (XML emitter)

Contract steps#

Every read backend executes the same eight logical steps between the source kwarg and the returned DataArray. Every writer backend executes the inverse of steps 1, 3, 4, 6, 7 (no decode, no orientation). The eight read steps:

  1. Source / kwarg validation. Reject conflicting flags, bad chunks, unsupported overview_level types, unsafe URLs, file-like buffers paired with GPU/dask, etc.

  2. Metadata parse. Read the IFD chain, geokeys, geotags, and any .tif.ovr sidecar. Resolve the requested overview level into a single IFD.

  3. Transform / georef classification. Decide whether the file is fully georeferenced, transform-only, CRS-only, no-georef, or rotated. Derive the georef_status value.

  4. Pixel decode. Strip or tile decompress, applying the codec, predictor, sample format, and planar config.

  5. Orientation / photometric handling. Apply the TIFF orientation tag and the MinIsWhite photometric inversion.

  6. Nodata mask + dtype cast. Replace the sentinel with NaN (promoting integer dtypes to float64 when needed), honour mask_nodata=False, apply any caller dtype= cast.

  7. Attrs finalization. Stamp the canonical transform, crs, crs_wkt, nodata, masked_nodata, nodata_pixels_present, nodata_dtype_cast, georef_status, rotated_affine, vrt_holes, and rich-tag attrs.

  8. DataArray construction. Build coords (y, x, optional band) and wrap the buffer in an xr.DataArray.

The write contract collapses to: validate kwargs, validate DataArray dims / shape / attrs, resolve transform, resolve CRS, resolve nodata, plan output layout (tile / strip / COG / BigTIFF), encode, and write IFD bytes.

Shared helpers#

Helpers below are the canonical implementations the refactor is moving every backend onto. Cells in the per-backend tables below cite these names.

Read helpers#

Helper

Location

Owns

_validate_dispatch_kwargs

_validation.py

step 1

_validate_chunks_arg

_validation.py

step 1 (chunks-specific)

_validate_overview_level_arg

_validation.py

step 1 (overview-specific)

validate_read_metadata

_validation.py

step 1 (rotated / unparseable-CRS / mixed-band)

_read_geo_info

__init__.py

step 2 (metadata-only mmap parse)

_parse_cog_http_meta

_reader.py

step 2 (HTTP / fsspec range parse)

extract_geo_info_with_overview_inheritance

_geotags.py

step 2 + step 3 (overview-aware georef)

select_overview_ifd

_header.py

step 2 (overview-level IFD selection)

discover_remote_sidecar

_sidecar.py

step 2 (shared .tif.ovr discovery + IFD merge for HTTP / fsspec / local; eager + chunked)

read_to_array

_reader.py

steps 4 + 5 (CPU decode + orientation + MinIsWhite)

_apply_orientation_gpu / _apply_orientation_geo_info

_backends/_gpu_helpers.py

step 5 (GPU side)

_apply_eager_nodata_mask

_attrs.py

step 6 (single-sentinel mask)

_validate_dtype_cast

_validation.py

step 6 (cast guard)

_set_nodata_attrs

_attrs.py

step 7 (nodata lifecycle attrs)

_populate_attrs_from_geo_info

_attrs.py

step 7 (transform / crs / georef_status)

_validate_read_geo_info

_attrs.py

step 7 (pre-attrs validation)

_finalize_eager_read

_attrs.py

wraps post-decode validation (_validate_read_geo_info) + steps 6 + 7 + 8 for eager backends

_finalize_lazy_read_attrs

_attrs.py

wraps post-decode validation (_validate_read_geo_info) + step 7 for lazy backends

geo_to_coords / coords_from_geo_info / coords_from_pixel_geometry

_coords.py

step 8 (coord build)

Write helpers#

Helper

Location

Owns

_validate_nodata_arg

_validation.py

nodata kwarg guard

_validate_tile_size_arg

_validation.py

tile-size guard

_validate_3d_writer_dims

_validation.py

3D dim ordering guard

_validate_writer_spatial_shape

_validation.py

spatial shape guard

_validate_no_rotated_affine

_validation.py

rotated-affine policy

validate_write_metadata

_validation.py

conflicting CRS / nodata / non-uniform coords

_resolve_spatial_coords

_runtime.py

y/x coord resolution

_has_no_georef_marker

_coords.py

no-georef marker detection

_coords_to_transform

_coords.py

derive transform from coords

_transform_from_attr

_coords.py

parse attrs['transform']

_require_transform_for_georeferenced

_coords.py

guard for missing transform

_validate_crs_arg / _validate_crs_fallback / _resolve_crs_to_wkt / _wkt_to_epsg

_crs.py

CRS resolution

_resolve_nodata_attr

_attrs.py

resolve attrs['nodata'] value

_should_restore_nan_sentinel

_attrs.py

NaN-restore-on-write decision

_extract_rich_tags

_attrs.py

rich-tag pass-through

Per-backend coverage#

Cells mark each contract step against each backend as shared (uses the helper named in parentheses), duplicated (helper exists but the backend still inlines the logic, or runs an extra inline check on top of the helper), or N/A (the step does not apply to that backend). Documented divergences are explicit deviations the refactor is keeping for now and that the call-site comments justify.

Read backends#

Step

open_geotiff (eager)

read_geotiff_dask

read_geotiff_gpu (eager)

read_geotiff_gpu (chunked)

read_vrt (eager)

read_vrt (chunked)

1. source / kwarg validation

shared (_validate_dispatch_kwargs then dispatches)

shared (_validate_dispatch_kwargs, _validate_chunks_arg)

shared (_validate_dispatch_kwargs, _validate_chunks_arg)

shared (_validate_dispatch_kwargs, _validate_chunks_arg)

shared (_validate_dispatch_kwargs, _validate_chunks_arg); duplicated inline overview-level / missing_sources / band_nodata value rejections

shared (_validate_dispatch_kwargs); duplicated inline overview-level / missing_sources / band_nodata value rejections

2. metadata parse

shared (read_to_array -> _parse_cog_http_meta for cloud (with .tif.ovr sidecar discovery via discover_remote_sidecar), parse_header + parse_all_ifds + sidecar otherwise)

shared (_read_geo_info for local, _parse_cog_http_meta for HTTP/fsspec, both with .tif.ovr sidecar discovery via discover_remote_sidecar – #2239)

shared (extract_geo_info_with_overview_inheritance, select_overview_ifd); duplicated inline IFD + sidecar load lifted from _read_geo_info

shared (extract_geo_info_with_overview_inheritance); duplicated inline IFD + sidecar handling

duplicated (_parse_vrt + _read_vrt_internal – VRT-specific, no shared metadata parser)

duplicated (_parse_vrt + per-chunk _vrt_chunk_read)

3. transform / georef classification

shared (_populate_attrs_from_geo_info via _finalize_eager_read)

shared (_populate_attrs_from_geo_info via _finalize_lazy_read_attrs)

shared (_populate_attrs_from_geo_info via _finalize_eager_read)

shared (_populate_attrs_from_geo_info via _finalize_lazy_read_attrs)

shared (_vrt_to_synthetic_geo_info -> _finalize_lazy_read_attrs); documented divergence: per-band nodata sentinel selection runs before the helper, and vrt_holes is injected through attrs_in because GeoInfo has no slot for it

shared (_vrt_to_synthetic_geo_info -> _finalize_lazy_read_attrs); same documented divergence

4. pixel decode

shared (read_to_array)

shared (per-chunk read_to_array / _fetch_decode_cog_http_tiles)

duplicated (inline GDS / KvikIO / nvCOMP path with CPU fallback via read_to_array)

duplicated (inline GDS + per-chunk delayed; HTTP / fsspec / stripped layouts fall back to read_geotiff_dask)

duplicated (_read_vrt_internal._read_data per source)

duplicated (per-chunk _vrt_chunk_read decodes only sources intersecting the window)

5. orientation / photometric

shared (read_to_array applies both)

shared (per chunk via read_to_array); rejects non-default orientation on HTTP COG dask path

shared on CPU-fallback (read_to_array); duplicated on pure GPU path (_apply_orientation_gpu, _apply_orientation_geo_info, inline MinIsWhite inversion)

shared on CPU-fallback; duplicated on disk-to-GPU per-chunk path (_decode_window_gpu_direct); rejects orientation != 1 in _gds_chunk_path_available

duplicated (inline NaN masking in _vrt._read_data for float sources; VRT does not carry an orientation tag)

duplicated (per chunk same as eager VRT)

6. nodata mask + dtype cast

shared (_apply_eager_nodata_mask + _validate_dtype_cast via _finalize_eager_read)

duplicated (per-chunk mask inline in _delayed_read_window); shared _validate_dtype_cast on graph dtype

shared (_apply_eager_nodata_mask via _finalize_eager_read) on both stripped and tiled paths

duplicated (per-chunk mask inline in _chunk_task); shared _validate_dtype_cast

duplicated (_apply_integer_sentinel_mask_with_presence for per-band integer sentinels, plus inline float-NaN proxy and pre-cast dtype tracking); shared _validate_dtype_cast

duplicated (per-chunk integer sentinel mask via _apply_integer_sentinel_mask_with_presence); shared _validate_dtype_cast

7. attrs finalization

shared (_finalize_eager_read -> _validate_read_geo_info + _populate_attrs_from_geo_info + _set_nodata_attrs)

shared (_finalize_lazy_read_attrs); documented divergence: nodata_pixels_present stays unset on lazy outputs (issue #2135)

shared (_finalize_eager_read); GPU MinIsWhite picks mask_sentinel from three local stashes (_mw_mask_nodata, _cpu_fallback_geo._mask_nodata, or raw nodata)

shared (_finalize_lazy_read_attrs); same nodata_pixels_present divergence as the CPU dask path

shared (_finalize_lazy_read_attrs); documented divergences: vrt_holes injected via attrs_in seed; per-band nodata selection runs before the helper; nodata_pixels_present stamped post-helper from a VRT-aware scan (_vrt_mask_with_presence / _vrt_scan_for_sentinel)

shared (_finalize_lazy_read_attrs); same VRT divergences as the eager VRT path

8. DataArray construction

shared (_finalize_eager_read builds xr.DataArray with coords_from_geo_info)

duplicated (assembles dask graph + builds xr.DataArray inline using geo_to_coords / coords_from_geo_info)

shared (_finalize_eager_read)

duplicated (assembles cupy dask graph + builds xr.DataArray inline using coords_from_geo_info)

duplicated (builds xr.DataArray inline using coords_from_pixel_geometry; documented divergence per the VRT branch in _backends/vrt.py)

duplicated (per-chunk delayed graph + inline xr.DataArray build)

Write backends#

The TIFF write contract is the inverse of the read contract: validate the DataArray, resolve transform / CRS / nodata from the attrs, lay out the output, encode, and emit bytes. Steps 4 and 5 (decode, orientation) have no write analogue; to_geotiff and write_geotiff_gpu always emit Orientation = 1 and rely on the writer assembler (_writer.write) for photometric handling.

Step

to_geotiff (CPU eager / dask)

write_geotiff_gpu

write_vrt

1. source / kwarg validation

shared (_validate_tile_size_arg, _validate_3d_writer_dims, _validate_writer_spatial_shape, _validate_nodata_arg, _validate_no_rotated_affine); duplicated inline compression / compression_level / cog / overview_levels / bigtiff / streaming_buffer_bytes / max_z_error / photometric / allow_internal_only_jpeg / allow_experimental_codecs value rejections

shared (_validate_tile_size_arg, _validate_3d_writer_dims, _validate_writer_spatial_shape, _validate_nodata_arg, _validate_no_rotated_affine); duplicated inline GPU-specific kwarg rejections (predictor, compression, cog, etc.)

shared (_validate_nodata_arg); duplicated inline path / vrt_path shim, crs / crs_wkt shim, source path validation

2. metadata parse

N/A (no source to parse; reads attrs off the DataArray)

N/A

duplicated (reads geokeys from the first source file to inherit CRS / nodata; lives in _vrt.write_vrt)

3. transform / georef classification

shared (_transform_from_attr then _coords_to_transform fallback; _require_transform_for_georeferenced; _has_no_georef_marker; _resolve_spatial_coords); shared validate_write_metadata (_check_write_conflicting_crs, _check_write_conflicting_nodata, _check_write_non_uniform_coords)

shared (same helpers as CPU eager)

duplicated (VRT XML emitter reads transform from the first source; CRS resolved via _resolve_crs_to_wkt)

4. pixel decode

N/A

N/A

N/A

5. orientation / photometric

duplicated (writer assembler _writer.write owns photometric tag emission; no shared resolver)

duplicated (same; emits Orientation=1)

N/A

6. nodata mask + dtype cast

shared (_resolve_nodata_attr, _should_restore_nan_sentinel); duplicated inline NaN-to-sentinel restore step in _writer.write

shared (_resolve_nodata_attr, _should_restore_nan_sentinel); duplicated inline GPU NaN-to-sentinel restore

duplicated (VRT carries the per-band <NoDataValue> as a string; XML emitter writes it verbatim)

7. attrs finalization

duplicated (writer pulls attrs['transform'], attrs['crs'], attrs['crs_wkt'], attrs['nodata'] inline; rich tags via _extract_rich_tags)

duplicated (same inline pull as CPU writer)

duplicated (VRT XML emitter writes attrs the source files carry)

8. DataArray construction

N/A (write path); the eager writer optionally returns the path string

N/A

N/A (returns the VRT path)

Intended flow#

The Phase 2 – Phase 6 work named in #2211 moves the duplicated cells above onto the helpers below. Each future PR should land in one row of one table. New backend code added between now and then should call these helpers directly rather than re-inlining the logic.

Read flow#

source kwarg
  -> _validate_dispatch_kwargs   (step 1; bundles _validate_overview_level_arg)
  -> _validate_chunks_arg as needed
  -> _read_geo_info / _parse_cog_http_meta  (step 2)
  -> extract_geo_info_with_overview_inheritance
  -> validate_read_metadata     (step 3 -- rotated / unparseable / mixed-band)
  -> read_to_array              (steps 4 + 5; per-chunk for dask)
  -> _finalize_eager_read       (steps 6 + 7 + 8 for eager backends)
     OR
     _finalize_lazy_read_attrs  (steps 7 only; caller assembles graph for 4 + 6 + 8)

Canonical helper per step:

Step

Canonical helper

1. kwarg validation

_validate_dispatch_kwargs (which bundles _validate_overview_level_arg); _validate_chunks_arg per-backend

2. metadata parse

_read_geo_info (local mmap) / _parse_cog_http_meta (cloud)

3. transform classification

_populate_attrs_from_geo_info (driven by geo_info.has_georef / rotated_affine / CRS-only fields)

4. pixel decode

read_to_array (CPU); GPU decoders remain backend-specific but must converge on a single decode_window entry point in Phase 5

5. orientation / photometric

read_to_array (CPU); _apply_orientation_gpu + _apply_orientation_geo_info (GPU)

6. nodata mask + dtype cast

_apply_eager_nodata_mask + _validate_dtype_cast (eager); for lazy backends the per-chunk mask stays inline but must call the same sentinel-resolution helper

7. attrs finalization

_finalize_eager_read (eager) / _finalize_lazy_read_attrs (lazy). Both wrap _validate_read_geo_info + _populate_attrs_from_geo_info + _set_nodata_attrs

8. DataArray construction

_finalize_eager_read (eager). Lazy backends remain inline because the dask graph assembly varies per backend; coords always come from coords_from_geo_info / coords_from_pixel_geometry

Write flow#

DataArray
  -> _validate_3d_writer_dims, _validate_writer_spatial_shape  (step 1)
  -> _validate_no_rotated_affine, _validate_nodata_arg, _validate_tile_size_arg
  -> _resolve_spatial_coords, _has_no_georef_marker
  -> _transform_from_attr (preferred) or _coords_to_transform (fallback)  (step 3)
  -> _require_transform_for_georeferenced
  -> _validate_crs_arg / _wkt_to_epsg / _resolve_crs_to_wkt
  -> validate_write_metadata    (step 3 -- conflicting CRS / nodata / non-uniform coords)
  -> _resolve_nodata_attr, _should_restore_nan_sentinel  (step 6)
  -> _extract_rich_tags         (step 7 -- attrs surface)
  -> _writer.write / GPU encoder

Canonical helper per step:

Step

Canonical helper

1. kwarg / dim validation

_validate_3d_writer_dims, _validate_writer_spatial_shape, _validate_no_rotated_affine, _validate_nodata_arg, _validate_tile_size_arg

3. transform / CRS / nodata

_transform_from_attr + _coords_to_transform fallback; _require_transform_for_georeferenced; _resolve_crs_to_wkt + _wkt_to_epsg; validate_write_metadata

6. nodata cast / restore

_resolve_nodata_attr, _should_restore_nan_sentinel

7. attrs surface

_extract_rich_tags (and _populate_attrs_from_geo_info on the read side defines the canonical attr names the writer should emit)

Phase plan reference#

Phase

Target rows from the per-backend tables

1 (this doc)

none – inventory only

2

step 3 (transform / georef) across every read and write backend

3

step 6 (nodata mask + dtype cast) across every read backend

4

step 7 (attrs finalization) across every read backend

5

step 2 (metadata parse) extraction into _sources.py / _decode.py / _layout.py modules

6

cross-backend parity tests covering every step on every backend

See #2211 for the full epic plan. Subsequent PRs in this series each target one row of one table and land behind cross-backend parity tests.