Geo Statistical Learning

taking the geospatial setting into account in performing statistical learning

geostatistical (transfer) leraning problem

Intro

  • classical learning theory cannot be applied straightforwardly to solve problems in geosciences
  • as the characteristics of these problems violate
    • fundamental assumptions to derive…
      • e.g. those for estimating the generalization (or prediction) error of learned models in unseen samples are crucial in practice

Leave-one-out (1974) also known as Cross-Validation

  • method for assessing and selecting learning models
    • was based on the idea that to estimate the prediction error
      • on an unseen sample one only needs to hide a seen sample from a dataset
        • and learn the model

k-fold cross validation (1975)

  • a family of error estimation methods that split a dataset
    • into non-overlapping "folds" for model evaluation

generalization of leave-one-out

  • may introduce bias in the error estimates
    • if the number of samples in the folds used for learning
      • is much smaller than the original number of samples

assumptions of methods above

  • samples come from independent and identically distributed (i.i.d) random variables
    • spatial samples are not i.i.d. and spatial correction needs to be modeled explicitly (with geostats theoery)
    sample mean of the empirical error used in the methods is an unbiased estimator
    • of the prediction error regardless of the i.i.d. assumption,
      • the precision of the estimator can be degraded considerably with non i.i.d. samples

h-block leave-one-out (1995)

  • developed for time-series data
  • it is based on the principle that stationary processes
    • achieve a correlation length `h`
      • after which the samples are not correlated
  • the time series data is split
    • such that samples used for error evaluation
      • are at least `h` steps distant from the samples used to learn the model

Spatial leave-one-out (2014)

  • generalization of h-block leave-out-out
    • from time-series to spatial data
      • where blocks have multiple dimensions

Block cross-validation (2016)

  • similar to k-fold cross-validation
    • faster alternative to spatial leave-one-out
  • creates folds using blocks of size equal to spatial correlation length
    • and separates samples for error evaluation
      • from samples used to learn the model
  • introduces concept of `dead zones`
    • regions discarded to avoid over-optimistic error estimates

2nd assumption for estimating generalization error in a classical learning theory

  • The distribution of unseen samples to which the model will be applied is equal to the distribution of samples over which the model was trained.
    • just not very realistic for geosciences
      • involving usually lots of variables with diff variability processes

Transfer learning introduces methods more amenable for geosciences

  • e.g. the covariate shift problem
    • where the samples on which the model is applied
      • have a distribution of covariates that differs from
        • the distribution of covariates over which the model was trained

Importance-weigted cross-validation (2007)

  • under covariate shift, cross validation is not unbiased
  • importance weights can be considered for each sample
    • to recover the unbiasedness property of the method
  • the method is unbiased under covariate shift for supervised learning tasks
    • regression and classification
  • importance weights used are ratios between the test/target probability density
    • and the source/train probability density of covariates
  • Density ratios are useful in a broader set of applications
    • two-sample tests, outlier detection, and distribution comparison

GEOSTATISTICAL LEARNING

definition

we define elements of statistical learning in geospatial setting

  • Consider a sample space \(\Omega\)
  • a source spatial domain \(\mathcal{D}_{s} \subset \mathbb{R}^{d_{s}}\)
  • and a target spatial domain \(\mathcal{D}_{t}\subset \mathbb{R}^{d_{t}}\)
    • on which stochastic processes (spatial random variables) are defined

      \(Z_{s_{j}} : \mathcal{D}_{s}\times\Omega \rightarrow \mathbb{R}\) , \(j = 1,2,...,n_{s}\) on source domain

      \(Z_{t_{j}} : \mathcal{D}_{t}\times\Omega \rightarrow \mathbb{R}\) , \(j = 1,2,....,n_{t}\) on target domain

practice example

given \((Z_{s_{j}})_{j}=1,2,...,n_{s}\)

  • may represent a collection of processes
    • observed remotely from satellite on a 2D surface \(\mathcal{D}_{s} \subset \mathbb{R}^{2}\)
  • whereas \((Z_{t_{j}})_{j}=1,2,...,n_{t}\)
    • may represent a collection of processes
      • that occuring within the 3D subsurface of the earth \(\mathcal{D}_{t} \subset \mathbb{R}^3\)

source and target domains

  • any process \(Z\) in these collections
    • can be viewed in two distinct ways

      • Geostatistical Theory
        • samples \(z(\cdot,\omega)\)
          • of the process \(Z(u,w)\)
            • are obtained by fixing \(\omega \in \Omega\)
        • samples are spatial maps
          • that assign a real number to
            • each location \(u\in\mathcal{D}\)
      • Learning Theory
        • scalar samples \(z(u, \cdot)\)
          • are obtained by fixing \(u \in \mathcal{D}\)
        • scalar samples are ordered into a feature vector \(x_{u}=(z_{1},z_2,...,z_n)\)
          • for a collection of processes \((Z_{j})_{j=1,2,...,n}\)
            • and for a specific location \(u \in \mathcal{D}\)
        • in this case
          • \(X_{u} : \Omega \rightarrow \mathbb{R}^{n}\) denotes the corresponding random vector of features
            • such that \(x_{u} \sim X_{u}\)

joint probability distribution of features

\(Pr(\{X_{u}\}_{u\in\mathcal{D}})\)

  • feature vectors \(X_u\) and \(X_v\)
    • for two different locations \(u \neq v\) are not independent
      • the closer the locations \(u,v\in\mathcal{D}\) in the spatial domain
        • the more similar are their features \(x_{u},x_{v}\in\mathbb{R}^{n}\) in the feature space
  • given that only one realization \(z^{obs}=z(\cdot,\omega)\sim Z\)
    • of the process is available at any given time
      • one must introduce stationarity assumptions inside \(\mathcal{D}\)
        • to pool together different scalar samples \(z(u,\cdot)\)
          • from different locations \(u\in \mathcal{D}\) in the spatial domain
            • and be able to estimate the distribution
  • regardless of stationarity assumptions involved in modeling…
    • we can assume that inside \(\mathcal{D}\) the probability \(Pr_{\mathcal{D}} (X) = Pr(\{X_u\}_{u\in\mathcal{D}})\) is well defined.

practice example

  • assume pointwise probability of features \(Pr_u(X)=Pr(X_u)\)

    • is not a function of location
      • that is \(Pr_u(X)=Pr(X),\forall u \in \mathcal{D}\)
    • under this assumption
      • samples from everywhere in \(\mathcal{D}\)
        • are used to estimate \(Pr(X)=Pr(Z_1,Z_2,...,Z_n)\)
          • with additional assumption that the feature vectors \(X_u\) and \(X_v\) are independent
  • the joint distribution of features for all locations can be written as…

    \(Pr_{\mathcal{D}}(X)=\Pi_{u\in\mathcal{D}}Pr_u(X)\)

the assumption of spatial independence is rarely defensible

  • pointwise stationarity often does not transfer from a source domain
    • where the model is learned to a target domain
      • where the model is applied
    • and consequently the joint distributions of features
      • differ \(Pr_D \neq Pr_D\)

spatial learning tasks

  • similar to classical learning tasks
    • but can leverage properties of the underlying spatial domain
  • classically, a learning task describes an action
    • in terms of avaiable features to produce new data
      • e.g. "predict feature \(Z_{j_0}\) from features \((Z_{j_1},Z_{j_2})\) "
      • "cluster the samples using features \((Z_{j_1},Z_{j_2},Z_{j_3})\) "
  • spatially, a learning task \(T\) involves the spatial domain \(\mathcal{D}\) besides the features
    • e.g. Agriculture: the task of identifying crops from satellite images
      • location that have the same crop type appear together
        • despite presence of noise in image layers
    • e.g. Mining: the task of segmenting mineral deposit from drillhole samples
      • using a set of features
        • assuming the segmentation result to be a contiguous volume of rock
          • which is an additional constraint in terms of spatial coordinates

geostatistical learning definition

  • let \(\mathcal{D}_{s}\) be a source spatial domain
    • and \(\mathcal{D}_{t}\) be a target spatial domain
  • let \(Pr_{D_s}(X_s)\)
    • and \(Pr_{D_t}(X_t)\)
      • be the joint distributions of features for all locations in these domains
        • and let \(T_s\) and \(T_t\) be two spatial learning tasks
  • geostatistical learning consists of learning \(T_t\) over \(D_t\)
    • using knowledge acquired while learning \(T_s\) over \(D_s\)
      • and assuming that the observed spatial data in \(D_s\) and \(D_t\)
        • are both a single spatial sample of \(Pr_{D_s}(X_s)\) and \(Pr_{D_t}(X_t)\)

covariate shift

  • assume that two spatial domains are different \(\mathcal{D}_{s} \neq \mathcal{D}_{t}\)
    • they share a set of processes \((Z_1,Z_2,...,Z_n)\)
  • additionally assume that pointwise stationarity holds
  • let \(Z_0 = f(Z_1,Z_2,...,Z_n)\) be a new process
    • obtained as a function of the shared processes
    • and assume that it has only been observed in \(\mathcal{D}_s\)
      • via a measuring device and or manual labeling
      • that is \(z_0^{obs}(.,\omega)\sim Z_0\)
        • is a spatial sample of the process \(Z_0\) over \(\mathcal{D}_s\)
  • under these assumptions… \(X_s = X_t = X = (Z_1,Z_2,...,Z_n,Z_0)\)
    • and the supervised learning task \(T_s = T_t = T\)
      • of predicting the process \(Z_0\)
        • regardless of location \(u \in \mathcal{D}_s \cup \mathcal{D}_t\)
  • let \(\mathcal{X} = X_{1:n}\) be the explanatory features
  • and \(\mathcal{Y}=X_{n+1}\) be the response feature
  • for any \(u\in \mathcal{D}_s\)
    • we can write \(Pr(\mathcal{X},\mathcal{Y}) = Pr_u(\mathcal{Y}|\mathcal{X})Pr_u(\mathcal{X})\)
  • likewise for any \(v \in \mathcal{D}_t\)
    • we can write \(Pr(\mathcal{X},\mathcal{Y}) = Pr_v(\mathcal{Y}|\mathcal{X})Pr_v(\mathcal{X})\)

covariate shift defined as follows

  • a geostatistical learning problem has the covariate shift property
    • when for any \(u\in\mathcal{D}_s\) and for any \(v \in \mathcal{D}_t\)
      • the distribution \(Pr_u(\mathcal{X}\mathcal{Y})\) and \(Pr_v(\mathcal{X}\mathcal{Y})\)
        • differ by \(Pr_u(\mathcal{X})\neq Pr_v(\mathcal{X})\)
          • while \(Pr_u(\mathcal{Y}|\mathcal{X})=Pr_v(\mathcal{Y}|\mathcal{X})\) for each and every location
  • this property is based on the idea that the underlying true function \(f\)
    • that created the process \(\mathcal{Y}=f(\mathcal{X})\)
      • is the same for all \(u\in \mathcal{D}_s\) and all \(v\in\mathcal{D}_t\)
    • in this case the function is approximated
      • by the conditional distribution \(Pr_u(\mathcal{Y}|\mathcal{X})=Pr_v(\mathcal{Y}|\mathcal{X})\)
        • for each and every location
  • due to the great variability in natural processes
    • whenever a model is…
      • learned using labels provided by experts on a source spatial domain
      • validated with classical train-validation-test methodologies
      • performs poorly on a target spatial domain where the labeling function is expected to the be same

there will be shifts in the distribution

spatial correlation

spatial dependence if often ignored

  • the closer are two locations \(u,v \in \mathcal{D}\) in a spatial domain
    • the more similar are their features \(x_u, x_v \in \mathbb{R}^n\) in the feature space
  • to quantify this spatial dependence in a collection of samples
    • is the variogram \(\gamma(h)\) which estimates
      • for each spatial lag \(h = ||u-v|| \in \mathbb{R}_0^+\)
        • a correlation \(\sigma^2 - \gamma(h)\) where
          • \(\sigma^2\) is the total sill in the samples

parallel algorithms for efficient variogram estimation

  • and can be useful tools for fast diagnosis of the spatial correlation property

    definition

    • a geostatistical learning problem has the spatial correlation property
      • when the variogram of any of the stochastic process \((Z_s_j)_{j=1,2,...,n_s}\) and \((Z_t_j)_{j=1,2,...,n_t}\) defined over \(\mathcal{D}_s\) and \(\mathcal{D}_t\) has a non-negligible positive range (or correlation length)

variograms can be used to simulate spatial processes

  • with theoretical correlation structure

  • in the feature space of two independent spatial processes \(Z_1\) and \(Z_2\)

    • simulated with direct (a.k.a. LU) Gaussian simulation
  • as we increase the variogram range \(r\) in a spatial domain \(\mathcal{D}\) with 100x100 pixels

    • we observe that the distribution of features \(Pr(\mathcal{X})=Pr(Z_1,Z_2)\)
      • is gradually deformed from a standard Gaussian \((r=0)\) to a \((r=80)\)
  • we illustrate the impact of spatial correlation for an interprocess correlation

    • of \(\rho(Z_1,Z_2)= 0.9\)
  • spatial correlations may have different impact in source and target domains

    • and can certainly affect the generalization error of learning models
  • we assume tha variogram range of source and target processes are equal

    • to faciliate this analysis of thre results

in practice source and target processes may also have different spatial correlation

  • which is a type of a shift that is not considered in classical transfer learning problems

generalization error of learning models

  • importance-weighted approximation of a related generalization error
    • based on pointwise stationarity assumptions
      • and the use of an efficent importance weighted cross-validation
        • method for error estimation
  • consider geostatistical learning problem \(\mathcal{P}=\{(\mathcal{D}_s, Pr_{\mathcal{D}_s}, \mathcal{T}_s),(\mathcal{D}_t, Pr_{\mathcal{D}_t}, \mathcal{T}_t)\}\) with a single supervised spatial learning task \(T_s=T_t=T\) (e.g. regression)
    • and assume that a set of response features \(\mathcal{Y}_u\)
      • are created by a function \(f\)
        • based on a set of explanatory features \(\mathcal{X}_u\)
          • for each and every location \(u \in \mathcal{D}_s \cup \mathcal{D}_t\)
  • our goal is to learn a model \(\{\mathcal{Y}_u\}_{u\in\mathcal{D}_t} \approx \hat{f}(\{\mathcal{X}\}_{u\in\mathcal{D}_t})\)
    • over the target domain \(\mathcal{D}_t\)
      • that approximates \(f\) in terms of expected risk
        • for some spatial supervised loss function \(\mathcal{L}\) \(\hat{f}=arg min \mathbb{E}_{Pr_{D_t}}[\mathcal{L}(\{\mathcal{Y}_u\}_{u\in\mathcal{D_t}}, g(\{\mathcal{X}_{u\in\mathcal{D}_t}\}))]\)
  • spatial samples of processes are drawn probability of target domain
    • and rearranged into feature vectors u,v for every location u in the target domain
  • the spatial loss function compares the spatial map of features from the sample
    • with the approximated map from model g
      • the model f is the model that minimizes the expected loss or risk
        • over the target domain

unlike classical definition of generalization error

  • the definition above for geostatistical learning problems
    • relies on a spatial loss function
      • and on spatial samples
        • like those produced via geostatistical simulation
  • for truly spatial learing models \(\hat{f}\)
    • that use multiple locations in the spatial domain to make predictions
      • this generelization error is more appropriate
    • only considering pointwise learning and not targeting spatial learning models

density ratio estimation

weighted cross-validation