xrspatial.zonal.stats#

xrspatial.zonal.stats(zones, values: DataArray, zone_ids: List[int | float] | None = None, stats_funcs: Dict | List | None = None, nodata_values: int | float = None, return_type: str = 'pandas.DataFrame', column: str | None = None, rasterize_kw: dict | None = None) DataFrame | DataFrame | DataArray[source]#

Calculate summary statistics for each zone defined by a zones dataset, based on values aggregate.

A single output value is computed for every zone in the input zones dataset.

This function currently supports numpy backed, and dask with numpy backed xarray DataArrays.

Parameters:
  • zones (xr.DataArray, GeoDataFrame, or list of (geometry, value) pairs) –

    Zone definitions. Can be:

    • A 2D xarray DataArray of numeric zone IDs.

    • A geopandas.GeoDataFrame (requires column).

    • A list of (shapely geometry, zone_id) pairs.

    When vector input is provided, rasterize() is called internally using values as the template grid. Results depend on raster resolution.

  • values (xr.DataArray or xr.Dataset) – values is a 2D xarray DataArray of numeric values (integers or floats). The input values raster contains the input values used in calculating the output statistic for each zone. In dask case, the chunksizes of zones and values should be matching. If not, values will be rechunked to be the same as of zones. When a Dataset is passed, stats are computed for each variable and columns are prefixed with the variable name (e.g. elevation_mean). For 3D time-series DataArrays, convert to a Dataset first using .to_dataset(dim='time') and pass the resulting Dataset.

  • zone_ids (list of ints, or floats) – List of zones to be included in calculation. If no zone_ids provided, all zones will be used.

  • stats_funcs (dict, or list of strings, optional) – The statistics to calculate for each zone. If a list, possible choices are subsets of the default options. In the dictionary case, all of its values must be callable. Function takes only one argument that is the values raster. The key becomes the column name in the output DataFrame. Defaults: ['mean', 'max', 'min', 'sum', 'std', 'var', 'count', 'majority'] for numpy/cupy and ['mean', 'max', 'min', 'sum', 'std', 'var', 'count'] for dask-backed inputs. 'majority' cannot be computed block-by-block so requesting it on a dask input raises ValueError instead of being silently dropped. Note that if zones and values are dask-backed DataArrays, stats_funcs must be provided as a list (or left unset).

  • nodata_values (int, float, default=None) – Nodata value in values raster. Cells with nodata_values do not belong to any zone, and thus excluded from calculation.

  • return_type (str, default='pandas.DataFrame') –

    Format of returned data. Must be one of:

    • 'pandas.DataFrame': one row per zone, one column per statistic (plus a zone column). For dask-backed inputs the result is a dask.dataframe.DataFrame. This is the only value supported for cupy, dask, and Dataset inputs.

    • 'xarray.DataArray': a DataArray whose first dimension is 'stats' and whose remaining dims match values. Cells outside the requested zones are filled with NaN. Only supported for numpy-backed DataArray inputs.

    Any other value raises ValueError.

  • column (str, optional) – Column name in the GeoDataFrame that contains zone IDs. Required when zones is a GeoDataFrame; must not be set for list-of-pairs or DataArray input.

  • rasterize_kw (dict, optional) – Extra keyword arguments forwarded to rasterize() when zones is vector input (e.g. {'all_touched': True}).

Notes

Empty zones. A zone that exists in zones but has no valid values (every cell is NaN, or every cell equals nodata_values) still appears as a row in the output. For such a zone, count is 0 because the number of valid cells is zero. Every other statistic (mean, min, max, sum, std, var, majority, and any custom callable) is NaN, since those values are undefined over an empty set. This holds across the numpy, cupy, and dask backends.

Returns:

stats_df – A pandas DataFrame, or a dask DataFrame where each column is a statistic and each row is a zone with zone id. When values is a Dataset, the returned DataFrame has columns prefixed by the variable name (e.g. elevation_mean, elevation_max), and return_type must be 'pandas.DataFrame'.

Return type:

Union[pandas.DataFrame, dask.dataframe.DataFrame]

Examples

stats() works with NumPy backed DataArray

>>> import numpy as np
>>> import xarray as xr
>>> from xrspatial.zonal import stats
>>> height, width = 10, 10
>>> values_data = np.array([
    [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
    [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
    [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
    [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
    [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
    [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
    [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
    [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
    [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
    [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])
>>> values = xr.DataArray(values_data)
>>> zones_data = np.array([
    [ 0.,  0.,  0.,  0.,  0., 10., 10., 10., 10., 10.],
    [ 0.,  0.,  0.,  0.,  0., 10., 10., 10., 10., 10.],
    [ 0.,  0.,  0.,  0.,  0., 10., 10., 10., 10., 10.],
    [ 0.,  0.,  0.,  0.,  0., 10., 10., 10., 10., 10.],
    [ 0.,  0.,  0.,  0.,  0., 10., 10., 10., 10., 10.],
    [20., 20., 20., 20., 20., 30., 30., 30., 30., 30.],
    [20., 20., 20., 20., 20., 30., 30., 30., 30., 30.],
    [20., 20., 20., 20., 20., 30., 30., 30., 30., 30.],
    [20., 20., 20., 20., 20., 30., 30., 30., 30., 30.],
    [20., 20., 20., 20., 20., 30., 30., 30., 30., 30.]])
>>> zones = xr.DataArray(zones_data)

>>> # Calculate Stats
>>> stats_df = stats(zones=zones, values=values)
>>> print(stats_df)
    zone  mean  max  min   sum       std    var  count
0   0    22.0   44    0   550  14.21267  202.0     25
1  10    27.0   49    5   675  14.21267  202.0     25
2  20    72.0   94   50  1800  14.21267  202.0     25
3  30    77.0   99   55  1925  14.21267  202.0     25

>>> # Custom Stats
>>> custom_stats ={'double_sum': lambda val: val.sum()*2}
>>> custom_stats_df = stats(zones=zones,
                            values=values,
                            stats_funcs=custom_stats)
>>> print(custom_stats_df)
    zone  double_sum
0   0     1100
1  10     1350
2  20     3600
3  30     3850
stats() works with Dask with NumPy backed DataArray
>>> import dask.array as da
>>> import dask.array as da
>>> values_dask = xr.DataArray(da.from_array(values_data, chunks=(3, 3)))
>>> zones_dask = xr.DataArray(da.from_array(zones_data, chunks=(3, 3)))
>>> # Calculate Stats with dask backed xarray DataArrays
>>> dask_stats_df = stats(zones=zones_dask, values=values_dask)
>>> print(type(dask_stats_df))
<class 'dask.dataframe.core.DataFrame'>
>>> print(dask_stats_df.compute())
    zone  mean  max  min   sum       std    var  count
0     0  22.0   44    0   550  14.21267  202.0     25
1    10  27.0   49    5   675  14.21267  202.0     25
2    20  72.0   94   50  1800  14.21267  202.0     25
3    30  77.0   99   55  1925  14.21267  202.0     25

stats() works with 3D time-series DataArrays via Dataset conversion

>>> # Convert a 3D time-series DataArray to a Dataset,
>>> # then pass to stats() to get per-timestep statistics.
>>> values_3d = xr.DataArray(
...     np.random.rand(2, 10, 10),
...     dims=['time', 'dim_0', 'dim_1'],
...     coords={'time': [2020, 2021]})
>>> ds = values_3d.to_dataset(dim='time')
>>> stats_df = stats(zones=zones, values=ds)
>>> # Columns: zone, 2020_mean, 2020_max, ..., 2021_mean, 2021_max, ...