taking the geospatial setting into account in performing statistical learning
geostatistical (transfer) leraning problem
Intro
- classical learning theory cannot be applied straightforwardly to solve problems in geosciences
-
as the characteristics of these problems violate
-
fundamental assumptions to derive…
- e.g. those for estimating the generalization (or prediction) error of learned models in unseen samples are crucial in practice
-
fundamental assumptions to derive…
Leave-one-out (1974) also known as Cross-Validation
-
method for assessing and selecting learning models
-
was based on the idea that to estimate the prediction error
-
on an unseen sample one only needs to hide a seen sample from a dataset
- and learn the model
-
on an unseen sample one only needs to hide a seen sample from a dataset
-
was based on the idea that to estimate the prediction error
k-fold cross validation (1975)
-
a family of error estimation methods that split a dataset
- into non-overlapping "folds" for model evaluation
generalization of leave-one-out
-
may introduce bias in the error estimates
-
if the number of samples in the folds used for learning
- is much smaller than the original number of samples
-
if the number of samples in the folds used for learning
assumptions of methods above
-
samples come from independent and identically distributed (i.i.d) random variables
- spatial samples are not i.i.d. and spatial correction needs to be modeled explicitly (with geostats theoery)
-
of the prediction error regardless of the i.i.d. assumption,
- the precision of the estimator can be degraded considerably with non i.i.d. samples
h-block leave-one-out (1995)
- developed for time-series data
-
it is based on the principle that stationary processes
-
achieve a correlation length `h`
- after which the samples are not correlated
-
achieve a correlation length `h`
-
the time series data is split
-
such that samples used for error evaluation
- are at least `h` steps distant from the samples used to learn the model
-
such that samples used for error evaluation
Spatial leave-one-out (2014)
-
generalization of h-block leave-out-out
-
from time-series to spatial data
- where blocks have multiple dimensions
-
from time-series to spatial data
Block cross-validation (2016)
-
similar to k-fold cross-validation
- faster alternative to spatial leave-one-out
-
creates folds using blocks of size equal to spatial correlation length
-
and separates samples for error evaluation
- from samples used to learn the model
-
and separates samples for error evaluation
-
introduces concept of `dead zones`
- regions discarded to avoid over-optimistic error estimates
2nd assumption for estimating generalization error in a classical learning theory
-
The distribution of unseen samples to which the model will be applied is equal to the distribution of samples over which the model was trained.
-
just not very realistic for geosciences
- involving usually lots of variables with diff variability processes
-
just not very realistic for geosciences
Transfer learning introduces methods more amenable for geosciences
-
e.g. the covariate shift problem
-
where the samples on which the model is applied
-
have a distribution of covariates that differs from
- the distribution of covariates over which the model was trained
-
have a distribution of covariates that differs from
-
where the samples on which the model is applied
Importance-weigted cross-validation (2007)
- under covariate shift, cross validation is not unbiased
-
importance weights can be considered for each sample
- to recover the unbiasedness property of the method
-
the method is unbiased under covariate shift for supervised learning tasks
- regression and classification
-
importance weights used are ratios between the test/target probability density
- and the source/train probability density of covariates
-
Density ratios are useful in a broader set of applications
- two-sample tests, outlier detection, and distribution comparison
GEOSTATISTICAL LEARNING
definition
we define elements of statistical learning in geospatial setting
- Consider a sample space \(\Omega\)
- a source spatial domain \(\mathcal{D}_{s} \subset \mathbb{R}^{d_{s}}\)
-
and a target spatial domain \(\mathcal{D}_{t}\subset \mathbb{R}^{d_{t}}\)
-
on which stochastic processes (spatial random variables) are defined
\(Z_{s_{j}} : \mathcal{D}_{s}\times\Omega \rightarrow \mathbb{R}\) , \(j = 1,2,...,n_{s}\) on source domain
\(Z_{t_{j}} : \mathcal{D}_{t}\times\Omega \rightarrow \mathbb{R}\) , \(j = 1,2,....,n_{t}\) on target domain
-
practice example
given \((Z_{s_{j}})_{j}=1,2,...,n_{s}\)
-
may represent a collection of processes
- observed remotely from satellite on a 2D surface \(\mathcal{D}_{s} \subset \mathbb{R}^{2}\)
-
whereas \((Z_{t_{j}})_{j}=1,2,...,n_{t}\)
-
may represent a collection of processes
- that occuring within the 3D subsurface of the earth \(\mathcal{D}_{t} \subset \mathbb{R}^3\)
-
may represent a collection of processes
source and target domains
-
any process \(Z\) in these collections
-
can be viewed in two distinct ways
-
Geostatistical Theory
-
samples \(z(\cdot,\omega)\)
-
of the process \(Z(u,w)\)
- are obtained by fixing \(\omega \in \Omega\)
-
of the process \(Z(u,w)\)
-
samples are spatial maps
-
that assign a real number to
- each location \(u\in\mathcal{D}\)
-
that assign a real number to
-
samples \(z(\cdot,\omega)\)
-
Learning Theory
-
scalar samples \(z(u, \cdot)\)
- are obtained by fixing \(u \in \mathcal{D}\)
-
scalar samples are ordered into a feature vector \(x_{u}=(z_{1},z_2,...,z_n)\)
-
for a collection of processes \((Z_{j})_{j=1,2,...,n}\)
- and for a specific location \(u \in \mathcal{D}\)
-
for a collection of processes \((Z_{j})_{j=1,2,...,n}\)
-
in this case
-
\(X_{u} : \Omega \rightarrow \mathbb{R}^{n}\) denotes the corresponding random vector of features
- such that \(x_{u} \sim X_{u}\)
-
\(X_{u} : \Omega \rightarrow \mathbb{R}^{n}\) denotes the corresponding random vector of features
-
scalar samples \(z(u, \cdot)\)
-
Geostatistical Theory
-
joint probability distribution of features
\(Pr(\{X_{u}\}_{u\in\mathcal{D}})\)
-
feature vectors \(X_u\) and \(X_v\)
-
for two different locations \(u \neq v\) are not independent
-
the closer the locations \(u,v\in\mathcal{D}\) in the spatial domain
- the more similar are their features \(x_{u},x_{v}\in\mathbb{R}^{n}\) in the feature space
-
the closer the locations \(u,v\in\mathcal{D}\) in the spatial domain
-
for two different locations \(u \neq v\) are not independent
-
given that only one realization \(z^{obs}=z(\cdot,\omega)\sim Z\)
-
of the process is available at any given time
-
one must introduce stationarity assumptions inside \(\mathcal{D}\)
-
to pool together different scalar samples \(z(u,\cdot)\)
-
from different locations \(u\in \mathcal{D}\) in the spatial domain
- and be able to estimate the distribution
-
from different locations \(u\in \mathcal{D}\) in the spatial domain
-
to pool together different scalar samples \(z(u,\cdot)\)
-
one must introduce stationarity assumptions inside \(\mathcal{D}\)
-
of the process is available at any given time
-
regardless of stationarity assumptions involved in modeling…
- we can assume that inside \(\mathcal{D}\) the probability \(Pr_{\mathcal{D}} (X) = Pr(\{X_u\}_{u\in\mathcal{D}})\) is well defined.
practice example
-
assume pointwise probability of features \(Pr_u(X)=Pr(X_u)\)
-
is not a function of location
- that is \(Pr_u(X)=Pr(X),\forall u \in \mathcal{D}\)
-
under this assumption
-
samples from everywhere in \(\mathcal{D}\)
-
are used to estimate \(Pr(X)=Pr(Z_1,Z_2,...,Z_n)\)
- with additional assumption that the feature vectors \(X_u\) and \(X_v\) are independent
-
are used to estimate \(Pr(X)=Pr(Z_1,Z_2,...,Z_n)\)
-
samples from everywhere in \(\mathcal{D}\)
-
is not a function of location
-
the joint distribution of features for all locations can be written as…
\(Pr_{\mathcal{D}}(X)=\Pi_{u\in\mathcal{D}}Pr_u(X)\)
the assumption of spatial independence is rarely defensible
-
pointwise stationarity often does not transfer from a source domain
-
where the model is learned to a target domain
- where the model is applied
-
and consequently the joint distributions of features
- differ \(Pr_D \neq Pr_D\)
-
where the model is learned to a target domain
spatial learning tasks
-
similar to classical learning tasks
- but can leverage properties of the underlying spatial domain
-
classically, a learning task describes an action
-
in terms of avaiable features to produce new data
- e.g. "predict feature \(Z_{j_0}\) from features \((Z_{j_1},Z_{j_2})\) "
- "cluster the samples using features \((Z_{j_1},Z_{j_2},Z_{j_3})\) "
-
in terms of avaiable features to produce new data
-
spatially, a learning task \(T\) involves the spatial domain \(\mathcal{D}\) besides the features
-
e.g. Agriculture: the task of identifying crops from satellite images
-
location that have the same crop type appear together
- despite presence of noise in image layers
-
location that have the same crop type appear together
-
e.g. Mining: the task of segmenting mineral deposit from drillhole samples
-
using a set of features
-
assuming the segmentation result to be a contiguous volume of rock
- which is an additional constraint in terms of spatial coordinates
-
assuming the segmentation result to be a contiguous volume of rock
-
using a set of features
-
e.g. Agriculture: the task of identifying crops from satellite images
geostatistical learning definition
-
let \(\mathcal{D}_{s}\) be a source spatial domain
- and \(\mathcal{D}_{t}\) be a target spatial domain
-
let \(Pr_{D_s}(X_s)\)
-
and \(Pr_{D_t}(X_t)\)
-
be the joint distributions of features for all locations in these domains
- and let \(T_s\) and \(T_t\) be two spatial learning tasks
-
be the joint distributions of features for all locations in these domains
-
and \(Pr_{D_t}(X_t)\)
-
geostatistical learning consists of learning \(T_t\) over \(D_t\)
-
using knowledge acquired while learning \(T_s\) over \(D_s\)
-
and assuming that the observed spatial data in \(D_s\) and \(D_t\)
- are both a single spatial sample of \(Pr_{D_s}(X_s)\) and \(Pr_{D_t}(X_t)\)
-
and assuming that the observed spatial data in \(D_s\) and \(D_t\)
-
using knowledge acquired while learning \(T_s\) over \(D_s\)
covariate shift
-
assume that two spatial domains are different \(\mathcal{D}_{s} \neq \mathcal{D}_{t}\)
- they share a set of processes \((Z_1,Z_2,...,Z_n)\)
- additionally assume that pointwise stationarity holds
-
let \(Z_0 = f(Z_1,Z_2,...,Z_n)\) be a new process
- obtained as a function of the shared processes
-
and assume that it has only been observed in \(\mathcal{D}_s\)
- via a measuring device and or manual labeling
-
that is \(z_0^{obs}(.,\omega)\sim Z_0\)
- is a spatial sample of the process \(Z_0\) over \(\mathcal{D}_s\)
-
under these assumptions… \(X_s = X_t = X = (Z_1,Z_2,...,Z_n,Z_0)\)
-
and the supervised learning task \(T_s = T_t = T\)
-
of predicting the process \(Z_0\)
- regardless of location \(u \in \mathcal{D}_s \cup \mathcal{D}_t\)
-
of predicting the process \(Z_0\)
-
and the supervised learning task \(T_s = T_t = T\)
- let \(\mathcal{X} = X_{1:n}\) be the explanatory features
- and \(\mathcal{Y}=X_{n+1}\) be the response feature
-
for any \(u\in \mathcal{D}_s\)
- we can write \(Pr(\mathcal{X},\mathcal{Y}) = Pr_u(\mathcal{Y}|\mathcal{X})Pr_u(\mathcal{X})\)
-
likewise for any \(v \in \mathcal{D}_t\)
- we can write \(Pr(\mathcal{X},\mathcal{Y}) = Pr_v(\mathcal{Y}|\mathcal{X})Pr_v(\mathcal{X})\)
covariate shift defined as follows
-
a geostatistical learning problem has the covariate shift property
-
when for any \(u\in\mathcal{D}_s\) and for any \(v \in \mathcal{D}_t\)
-
the distribution \(Pr_u(\mathcal{X}\mathcal{Y})\) and \(Pr_v(\mathcal{X}\mathcal{Y})\)
-
differ by \(Pr_u(\mathcal{X})\neq Pr_v(\mathcal{X})\)
- while \(Pr_u(\mathcal{Y}|\mathcal{X})=Pr_v(\mathcal{Y}|\mathcal{X})\) for each and every location
-
differ by \(Pr_u(\mathcal{X})\neq Pr_v(\mathcal{X})\)
-
the distribution \(Pr_u(\mathcal{X}\mathcal{Y})\) and \(Pr_v(\mathcal{X}\mathcal{Y})\)
-
when for any \(u\in\mathcal{D}_s\) and for any \(v \in \mathcal{D}_t\)
-
this property is based on the idea that the underlying true function \(f\)
-
that created the process \(\mathcal{Y}=f(\mathcal{X})\)
- is the same for all \(u\in \mathcal{D}_s\) and all \(v\in\mathcal{D}_t\)
-
in this case the function is approximated
-
by the conditional distribution \(Pr_u(\mathcal{Y}|\mathcal{X})=Pr_v(\mathcal{Y}|\mathcal{X})\)
- for each and every location
-
by the conditional distribution \(Pr_u(\mathcal{Y}|\mathcal{X})=Pr_v(\mathcal{Y}|\mathcal{X})\)
-
that created the process \(\mathcal{Y}=f(\mathcal{X})\)
-
due to the great variability in natural processes
-
whenever a model is…
- learned using labels provided by experts on a source spatial domain
- validated with classical train-validation-test methodologies
- performs poorly on a target spatial domain where the labeling function is expected to the be same
-
whenever a model is…
there will be shifts in the distribution
spatial correlation
spatial dependence if often ignored
-
the closer are two locations \(u,v \in \mathcal{D}\) in a spatial domain
- the more similar are their features \(x_u, x_v \in \mathbb{R}^n\) in the feature space
-
to quantify this spatial dependence in a collection of samples
-
is the variogram \(\gamma(h)\) which estimates
-
for each spatial lag \(h = ||u-v|| \in \mathbb{R}_0^+\)
-
a correlation \(\sigma^2 - \gamma(h)\) where
- \(\sigma^2\) is the total sill in the samples
-
a correlation \(\sigma^2 - \gamma(h)\) where
-
for each spatial lag \(h = ||u-v|| \in \mathbb{R}_0^+\)
-
is the variogram \(\gamma(h)\) which estimates
parallel algorithms for efficient variogram estimation
-
and can be useful tools for fast diagnosis of the spatial correlation property
definition
-
a geostatistical learning problem has the spatial correlation property
- when the variogram of any of the stochastic process \((Z_s_j)_{j=1,2,...,n_s}\) and \((Z_t_j)_{j=1,2,...,n_t}\) defined over \(\mathcal{D}_s\) and \(\mathcal{D}_t\) has a non-negligible positive range (or correlation length)
-
a geostatistical learning problem has the spatial correlation property
variograms can be used to simulate spatial processes
-
with theoretical correlation structure
-
in the feature space of two independent spatial processes \(Z_1\) and \(Z_2\)
- simulated with direct (a.k.a. LU) Gaussian simulation
-
as we increase the variogram range \(r\) in a spatial domain \(\mathcal{D}\) with 100x100 pixels
-
we observe that the distribution of features \(Pr(\mathcal{X})=Pr(Z_1,Z_2)\)
- is gradually deformed from a standard Gaussian \((r=0)\) to a \((r=80)\)
-
we observe that the distribution of features \(Pr(\mathcal{X})=Pr(Z_1,Z_2)\)
-
we illustrate the impact of spatial correlation for an interprocess correlation
- of \(\rho(Z_1,Z_2)= 0.9\)
-
spatial correlations may have different impact in source and target domains
- and can certainly affect the generalization error of learning models
-
we assume tha variogram range of source and target processes are equal
- to faciliate this analysis of thre results
in practice source and target processes may also have different spatial correlation
- which is a type of a shift that is not considered in classical transfer learning problems
generalization error of learning models
-
importance-weighted approximation of a related generalization error
-
based on pointwise stationarity assumptions
-
and the use of an efficent importance weighted cross-validation
- method for error estimation
-
and the use of an efficent importance weighted cross-validation
-
based on pointwise stationarity assumptions
-
consider geostatistical learning problem \(\mathcal{P}=\{(\mathcal{D}_s, Pr_{\mathcal{D}_s}, \mathcal{T}_s),(\mathcal{D}_t, Pr_{\mathcal{D}_t}, \mathcal{T}_t)\}\) with a single supervised spatial learning task \(T_s=T_t=T\) (e.g. regression)
-
and assume that a set of response features \(\mathcal{Y}_u\)
-
are created by a function \(f\)
-
based on a set of explanatory features \(\mathcal{X}_u\)
- for each and every location \(u \in \mathcal{D}_s \cup \mathcal{D}_t\)
-
based on a set of explanatory features \(\mathcal{X}_u\)
-
are created by a function \(f\)
-
and assume that a set of response features \(\mathcal{Y}_u\)
-
our goal is to learn a model \(\{\mathcal{Y}_u\}_{u\in\mathcal{D}_t} \approx \hat{f}(\{\mathcal{X}\}_{u\in\mathcal{D}_t})\)
-
over the target domain \(\mathcal{D}_t\)
-
that approximates \(f\) in terms of expected risk
- for some spatial supervised loss function \(\mathcal{L}\) \(\hat{f}=arg min \mathbb{E}_{Pr_{D_t}}[\mathcal{L}(\{\mathcal{Y}_u\}_{u\in\mathcal{D_t}}, g(\{\mathcal{X}_{u\in\mathcal{D}_t}\}))]\)
-
that approximates \(f\) in terms of expected risk
-
over the target domain \(\mathcal{D}_t\)
-
spatial samples of processes are drawn probability of target domain
- and rearranged into feature vectors u,v for every location u in the target domain
-
the spatial loss function compares the spatial map of features from the sample
-
with the approximated map from model g
-
the model f is the model that minimizes the expected loss or risk
- over the target domain
-
the model f is the model that minimizes the expected loss or risk
-
with the approximated map from model g
unlike classical definition of generalization error
-
the definition above for geostatistical learning problems
-
relies on a spatial loss function
-
and on spatial samples
- like those produced via geostatistical simulation
-
and on spatial samples
-
relies on a spatial loss function
-
for truly spatial learing models \(\hat{f}\)
-
that use multiple locations in the spatial domain to make predictions
- this generelization error is more appropriate
- only considering pointwise learning and not targeting spatial learning models
-
that use multiple locations in the spatial domain to make predictions