\(IC = f(SSE) + penalty\)
-
when fitting models, its possible to increase the maximum likelihood
-
by adding parameters, but doing so may result in overfitting
-
BIC resolves this problem by introducing a penalty term
- for the number of parameters in the model
-
BIC resolves this problem by introducing a penalty term
-
by adding parameters, but doing so may result in overfitting
-
SBC or SIC or SBIC is a criterion for model selection
- among a finite set of models
the BIC is an increasing function of the error variance $σe2$
- and an increasing function of \(k\)
BIC is merely a heuristic and not tranformed Bayes factors
-
BIC formally defined as \(BIC = k ln(n) - 2 ln(\hat{L})\)
Where…
- \(\hat{L}\) = the maximized value of the likelihood function of the model \(M\)
-
\(\hat{L} = p(x |\hat{\theta},M)\) where \(\{\hat{\theta}\}\) are the parameter values
- that maximize the likelihood function and \(x\) is the observed data
-
\(n\) = the number of data points in \(x\)
- the number of observations, or equivalently, the sample size
- \(k\) = the number of parameters estimated by the model
BIC SUFFERS FROM TWO MAIN LIMIMTATION
- the above approximation is only valid for sample size \(n\) much larger than the number \(k\) of parameters in the model
- the BIC cannot handle complex collections of models as in the variable selection or feature selection problem in high-dimension
Estimation the dimension of a model
how to choose the appropriate dimensionality of a model
-
that will fit a given set of observations
e.g. the choice of degree for a polynomial regression
- or the choice of order for a multi-step Markov chain
-
maximum likelihood principle invariably leads to choosing the highest dimension possible
Akaike suggests…
-
for the problem of choosing among different models
- with different numbers of parameters
-
suggestion amounts to maximizing the likelihood function separately
-
for each model \(j\) obtaining say \(M_{j}=(X_1,...,X_n)\)
-
and then choosing the model for which \(\log M_j(X_1,...,X_n)-k_j\)
- is the largest where \(k_j\) is the dimension of the model
-
and then choosing the model for which \(\log M_j(X_1,...,X_n)-k_j\)
-
for each model \(j\) obtaining say \(M_{j}=(X_1,...,X_n)\)
-
for the problem of choosing among different models
Alternative
-
look for appropriate modification of maximum likelihood for our case
-
by studying asymptotic behavior of Bayes estimators
-
under a special class of priors
-
these priors are not absolutely continuous
-
since they put positive probability on some lower-dimensional subspaces of the parameter space
- subspaces that correspond to the competing models
-
since they put positive probability on some lower-dimensional subspaces of the parameter space
-
these priors are not absolutely continuous
-
under a special class of priors
-
in the large sample limit
-
the leading term of the Bayes estimator turns out to the maximum likelihood estimator
-
the leading term depends on the prior only through its support
- while the second order term does reflect singularities of a priori distribution
-
the leading term depends on the prior only through its support
-
the leading term of the Bayes estimator turns out to the maximum likelihood estimator
-
by studying asymptotic behavior of Bayes estimators
Choose the model for which \(\log M_j(X_1,...,X_n)-\frac{1}{2}k_j\log n\) is the largest.
-
in the general parameter space there is no intrinsic linear structure
-
observations come from a Koopman-Darmois family
-
relative to some fixed measure on the sample space
- they posess a density of the form \(f(x,\theta)=exp(\theta \cdot y(x) - b(\theta))\) Where \(\theta\) ranges over natural parameters space \(\Theta\) , a convex subset of the K-dimensional Euclidean space, and \(y\) is the sufficient K-dimensional statistic
-
relative to some fixed measure on the sample space
-
the competing models are given by sets of of the form \(m_j \cap \Theta\)
- where each \(m_j\) is a \(k_j\) - dimensional linear submanifold of K-dimensional space
-
observations come from a Koopman-Darmois family