taking the geospatial setting into account in performing statistical learning

geostatistical (transfer) leraning problem

Intro

classical learning theory cannot be applied straightforwardly to solve problems in geosciences
as the characteristics of these problems violate
- fundamental assumptions to derive…
  - e.g. those for estimating the generalization (or prediction) error of learned models in unseen samples are crucial in practice

Leave-one-out (1974) also known as Cross-Validation

method for assessing and selecting learning models
- was based on the idea that to estimate the prediction error
  - on an unseen sample one only needs to hide a seen sample from a dataset
    - and learn the model

k-fold cross validation (1975)

a family of error estimation methods that split a dataset
- into non-overlapping "folds" for model evaluation

generalization of leave-one-out

may introduce bias in the error estimates
- if the number of samples in the folds used for learning
  - is much smaller than the original number of samples

assumptions of methods above

samples come from independent and identically distributed (i.i.d) random variables
- spatial samples are not i.i.d. and spatial correction needs to be modeled explicitly (with geostats theoery)
sample mean of the empirical error used in the methods is an unbiased estimator
- of the prediction error regardless of the i.i.d. assumption,
  - the precision of the estimator can be degraded considerably with non i.i.d. samples

h-block leave-one-out (1995)

developed for time-series data
it is based on the principle that stationary processes
- achieve a correlation length `h`
  - after which the samples are not correlated
the time series data is split
- such that samples used for error evaluation
  - are at least `h` steps distant from the samples used to learn the model

Spatial leave-one-out (2014)

generalization of h-block leave-out-out
- from time-series to spatial data
  - where blocks have multiple dimensions

Block cross-validation (2016)

similar to k-fold cross-validation
- faster alternative to spatial leave-one-out
creates folds using blocks of size equal to spatial correlation length
- and separates samples for error evaluation
  - from samples used to learn the model
introduces concept of `dead zones`
- regions discarded to avoid over-optimistic error estimates

2nd assumption for estimating generalization error in a classical learning theory

The distribution of unseen samples to which the model will be applied is equal to the distribution of samples over which the model was trained.
- just not very realistic for geosciences
  - involving usually lots of variables with diff variability processes

Transfer learning introduces methods more amenable for geosciences

e.g. the covariate shift problem
- where the samples on which the model is applied
  - have a distribution of covariates that differs from
    - the distribution of covariates over which the model was trained

Importance-weigted cross-validation (2007)

under covariate shift, cross validation is not unbiased
importance weights can be considered for each sample
- to recover the unbiasedness property of the method
the method is unbiased under covariate shift for supervised learning tasks
- regression and classification
importance weights used are ratios between the test/target probability density
- and the source/train probability density of covariates
Density ratios are useful in a broader set of applications
- two-sample tests, outlier detection, and distribution comparison

GEOSTATISTICAL LEARNING

definition

we define elements of statistical learning in geospatial setting

Consider a sample space \(\Omega\)
a source spatial domain \(\mathcal{D}_{s} \subset \mathbb{R}^{d_{s}}\)
and a target spatial domain \(\mathcal{D}_{t}\subset \mathbb{R}^{d_{t}}\)
- on which stochastic processes (spatial random variables) are defined
  
  \(Z_{s_{j}} : \mathcal{D}_{s}\times\Omega \rightarrow \mathbb{R}\) , \(j = 1,2,...,n_{s}\) on source domain
  
  \(Z_{t_{j}} : \mathcal{D}_{t}\times\Omega \rightarrow \mathbb{R}\) , \(j = 1,2,....,n_{t}\) on target domain

practice example

given \((Z_{s_{j}})_{j}=1,2,...,n_{s}\)

may represent a collection of processes
- observed remotely from satellite on a 2D surface \(\mathcal{D}_{s} \subset \mathbb{R}^{2}\)
whereas \((Z_{t_{j}})_{j}=1,2,...,n_{t}\)
- may represent a collection of processes
  - that occuring within the 3D subsurface of the earth \(\mathcal{D}_{t} \subset \mathbb{R}^3\)

source and target domains

any process \(Z\) in these collections
- can be viewed in two distinct ways
  - Geostatistical Theory
    - samples \(z(\cdot,\omega)\)
      - of the process \(Z(u,w)\)
        
        are obtained by fixing \(\omega \in \Omega\)
    - samples are spatial maps
      - that assign a real number to
        
        each location \(u\in\mathcal{D}\)
  - Learning Theory
    - scalar samples \(z(u, \cdot)\)
      - are obtained by fixing \(u \in \mathcal{D}\)
    - scalar samples are ordered into a feature vector \(x_{u}=(z_{1},z_2,...,z_n)\)
      - for a collection of processes \((Z_{j})_{j=1,2,...,n}\)
        
        and for a specific location \(u \in \mathcal{D}\)
    - in this case
      - \(X_{u} : \Omega \rightarrow \mathbb{R}^{n}\) denotes the corresponding random vector of features
        
        such that \(x_{u} \sim X_{u}\)

joint probability distribution of features

\(Pr(\{X_{u}\}_{u\in\mathcal{D}})\)

feature vectors \(X_u\) and \(X_v\)
- for two different locations \(u \neq v\) are not independent
  - the closer the locations \(u,v\in\mathcal{D}\) in the spatial domain
    - the more similar are their features \(x_{u},x_{v}\in\mathbb{R}^{n}\) in the feature space
given that only one realization \(z^{obs}=z(\cdot,\omega)\sim Z\)
- of the process is available at any given time
  - one must introduce stationarity assumptions inside \(\mathcal{D}\)
    - to pool together different scalar samples \(z(u,\cdot)\)
      - from different locations \(u\in \mathcal{D}\) in the spatial domain
        
        and be able to estimate the distribution
regardless of stationarity assumptions involved in modeling…
- we can assume that inside \(\mathcal{D}\) the probability \(Pr_{\mathcal{D}} (X) = Pr(\{X_u\}_{u\in\mathcal{D}})\) is well defined.

practice example

assume pointwise probability of features \(Pr_u(X)=Pr(X_u)\)
- is not a function of location
  - that is \(Pr_u(X)=Pr(X),\forall u \in \mathcal{D}\)
- under this assumption
  - samples from everywhere in \(\mathcal{D}\)
    - are used to estimate \(Pr(X)=Pr(Z_1,Z_2,...,Z_n)\)
      - with additional assumption that the feature vectors \(X_u\) and \(X_v\) are independent
the joint distribution of features for all locations can be written as…

\(Pr_{\mathcal{D}}(X)=\Pi_{u\in\mathcal{D}}Pr_u(X)\)

the assumption of spatial independence is rarely defensible

pointwise stationarity often does not transfer from a source domain
- where the model is learned to a target domain
  - where the model is applied
- and consequently the joint distributions of features
  - differ \(Pr_D \neq Pr_D\)

spatial learning tasks

similar to classical learning tasks
- but can leverage properties of the underlying spatial domain
classically, a learning task describes an action
- in terms of avaiable features to produce new data
  - e.g. "predict feature \(Z_{j_0}\) from features \((Z_{j_1},Z_{j_2})\) "
  - "cluster the samples using features \((Z_{j_1},Z_{j_2},Z_{j_3})\) "
spatially, a learning task \(T\) involves the spatial domain \(\mathcal{D}\) besides the features
- e.g. Agriculture: the task of identifying crops from satellite images
  - location that have the same crop type appear together
    - despite presence of noise in image layers
- e.g. Mining: the task of segmenting mineral deposit from drillhole samples
  - using a set of features
    - assuming the segmentation result to be a contiguous volume of rock
      - which is an additional constraint in terms of spatial coordinates

geostatistical learning definition

let \(\mathcal{D}_{s}\) be a source spatial domain
- and \(\mathcal{D}_{t}\) be a target spatial domain
let \(Pr_{D_s}(X_s)\)
- and \(Pr_{D_t}(X_t)\)
  - be the joint distributions of features for all locations in these domains
    - and let \(T_s\) and \(T_t\) be two spatial learning tasks
geostatistical learning consists of learning \(T_t\) over \(D_t\)
- using knowledge acquired while learning \(T_s\) over \(D_s\)
  - and assuming that the observed spatial data in \(D_s\) and \(D_t\)
    - are both a single spatial sample of \(Pr_{D_s}(X_s)\) and \(Pr_{D_t}(X_t)\)

covariate shift

assume that two spatial domains are different \(\mathcal{D}_{s} \neq \mathcal{D}_{t}\)
- they share a set of processes \((Z_1,Z_2,...,Z_n)\)
additionally assume that pointwise stationarity holds
let \(Z_0 = f(Z_1,Z_2,...,Z_n)\) be a new process
- obtained as a function of the shared processes
- and assume that it has only been observed in \(\mathcal{D}_s\)
  - via a measuring device and or manual labeling
  - that is \(z_0^{obs}(.,\omega)\sim Z_0\)
    - is a spatial sample of the process \(Z_0\) over \(\mathcal{D}_s\)
under these assumptions… \(X_s = X_t = X = (Z_1,Z_2,...,Z_n,Z_0)\)
- and the supervised learning task \(T_s = T_t = T\)
  - of predicting the process \(Z_0\)
    - regardless of location \(u \in \mathcal{D}_s \cup \mathcal{D}_t\)
let \(\mathcal{X} = X_{1:n}\) be the explanatory features
and \(\mathcal{Y}=X_{n+1}\) be the response feature
for any \(u\in \mathcal{D}_s\)
- we can write \(Pr(\mathcal{X},\mathcal{Y}) = Pr_u(\mathcal{Y}|\mathcal{X})Pr_u(\mathcal{X})\)
likewise for any \(v \in \mathcal{D}_t\)
- we can write \(Pr(\mathcal{X},\mathcal{Y}) = Pr_v(\mathcal{Y}|\mathcal{X})Pr_v(\mathcal{X})\)

covariate shift defined as follows

a geostatistical learning problem has the covariate shift property
- when for any \(u\in\mathcal{D}_s\) and for any \(v \in \mathcal{D}_t\)
  - the distribution \(Pr_u(\mathcal{X}\mathcal{Y})\) and \(Pr_v(\mathcal{X}\mathcal{Y})\)
    - differ by \(Pr_u(\mathcal{X})\neq Pr_v(\mathcal{X})\)
      - while \(Pr_u(\mathcal{Y}|\mathcal{X})=Pr_v(\mathcal{Y}|\mathcal{X})\) for each and every location
this property is based on the idea that the underlying true function \(f\)
- that created the process \(\mathcal{Y}=f(\mathcal{X})\)
  - is the same for all \(u\in \mathcal{D}_s\) and all \(v\in\mathcal{D}_t\)
- in this case the function is approximated
  - by the conditional distribution \(Pr_u(\mathcal{Y}|\mathcal{X})=Pr_v(\mathcal{Y}|\mathcal{X})\)
    - for each and every location
due to the great variability in natural processes
- whenever a model is…
  - learned using labels provided by experts on a source spatial domain
  - validated with classical train-validation-test methodologies
  - performs poorly on a target spatial domain where the labeling function is expected to the be same

there will be shifts in the distribution

spatial correlation

spatial dependence if often ignored

the closer are two locations \(u,v \in \mathcal{D}\) in a spatial domain
- the more similar are their features \(x_u, x_v \in \mathbb{R}^n\) in the feature space
to quantify this spatial dependence in a collection of samples
- is the variogram \(\gamma(h)\) which estimates
  - for each spatial lag \(h = ||u-v|| \in \mathbb{R}_0^+\)
    - a correlation \(\sigma^2 - \gamma(h)\) where
      - \(\sigma^2\) is the total sill in the samples

parallel algorithms for efficient variogram estimation

and can be useful tools for fast diagnosis of the spatial correlation property

definition
- a geostatistical learning problem has the spatial correlation property
  - when the variogram of any of the stochastic process \((Z_s_j)_{j=1,2,...,n_s}\) and \((Z_t_j)_{j=1,2,...,n_t}\) defined over \(\mathcal{D}_s\) and \(\mathcal{D}_t\) has a non-negligible positive range (or correlation length)

variograms can be used to simulate spatial processes

with theoretical correlation structure
in the feature space of two independent spatial processes \(Z_1\) and \(Z_2\)
- simulated with direct (a.k.a. LU) Gaussian simulation
as we increase the variogram range \(r\) in a spatial domain \(\mathcal{D}\) with 100x100 pixels
- we observe that the distribution of features \(Pr(\mathcal{X})=Pr(Z_1,Z_2)\)
  - is gradually deformed from a standard Gaussian \((r=0)\) to a \((r=80)\)
we illustrate the impact of spatial correlation for an interprocess correlation
- of \(\rho(Z_1,Z_2)= 0.9\)
spatial correlations may have different impact in source and target domains
- and can certainly affect the generalization error of learning models
we assume tha variogram range of source and target processes are equal
- to faciliate this analysis of thre results

in practice source and target processes may also have different spatial correlation

which is a type of a shift that is not considered in classical transfer learning problems

generalization error of learning models

importance-weighted approximation of a related generalization error
- based on pointwise stationarity assumptions
  - and the use of an efficent importance weighted cross-validation
    - method for error estimation
consider geostatistical learning problem \(\mathcal{P}=\{(\mathcal{D}_s, Pr_{\mathcal{D}_s}, \mathcal{T}_s),(\mathcal{D}_t, Pr_{\mathcal{D}_t}, \mathcal{T}_t)\}\) with a single supervised spatial learning task \(T_s=T_t=T\) (e.g. regression)
- and assume that a set of response features \(\mathcal{Y}_u\)
  - are created by a function \(f\)
    - based on a set of explanatory features \(\mathcal{X}_u\)
      - for each and every location \(u \in \mathcal{D}_s \cup \mathcal{D}_t\)
our goal is to learn a model \(\{\mathcal{Y}_u\}_{u\in\mathcal{D}_t} \approx \hat{f}(\{\mathcal{X}\}_{u\in\mathcal{D}_t})\)
- over the target domain \(\mathcal{D}_t\)
  - that approximates \(f\) in terms of expected risk
    - for some spatial supervised loss function \(\mathcal{L}\) \(\hat{f}=arg min \mathbb{E}_{Pr_{D_t}}[\mathcal{L}(\{\mathcal{Y}_u\}_{u\in\mathcal{D_t}}, g(\{\mathcal{X}_{u\in\mathcal{D}_t}\}))]\)
spatial samples of processes are drawn probability of target domain
- and rearranged into feature vectors u,v for every location u in the target domain
the spatial loss function compares the spatial map of features from the sample
- with the approximated map from model g
  - the model f is the model that minimizes the expected loss or risk
    - over the target domain

unlike classical definition of generalization error

the definition above for geostatistical learning problems
- relies on a spatial loss function
  - and on spatial samples
    - like those produced via geostatistical simulation
for truly spatial learing models \(\hat{f}\)
- that use multiple locations in the spatial domain to make predictions
  - this generelization error is more appropriate
- only considering pointwise learning and not targeting spatial learning models

Geo Statistical Learning

Intro

Leave-one-out (1974) also known as Cross-Validation

k-fold cross validation (1975)

h-block leave-one-out (1995)

Spatial leave-one-out (2014)

Block cross-validation (2016)

Importance-weigted cross-validation (2007)

GEOSTATISTICAL LEARNING

definition

practice example

source and target domains

joint probability distribution of features

practice example

spatial learning tasks

geostatistical learning definition

covariate shift

spatial correlation

generalization error of learning models

density ratio estimation

weighted cross-validation