-
in regression analysis
- the distinction between errors and residuals is subtle and important studentized residuals
-
given an unobservable function that relates the independent variable to the dependent variable
-
i.e. a line – the deviations of the dependent variable observations from this function are the unobservable errors
-
if one runs regression on some data
-
then the deviations of the dependent variable observations
- from the fitted function are the residuals
-
then the deviations of the dependent variable observations
-
if one runs regression on some data
-
i.e. a line – the deviations of the dependent variable observations from this function are the unobservable errors
\(Y = x \beta + E\)
Least Squares Est. \(\hat\beta = (X^{T}X)^{-1}X^{T}Y\)
\(Y \sim N(x\beta,\sigma^{2}_{\varepsilon})\)
\(\hat\beta = N((X^{T}X)^{-1}X^{T}X\beta = \beta_1\)
\((X^{T}X)^{-1}X^{T}(\sigma^2_\varepsilon) X(X^{T}X)^{-1} = \sigma^2_\varepsilon(X^{T}X)^{-1}\)
\(var(X_i \hat\beta) \rightarrow X_i(X^{T}X)^{-1}X^{T}_{-1}\)
to Compare residuals at different inputs, one needs to adjust the residuals by the expected variablity of residuals
- this is called studentizing
Cook's Distance
common estimate of the influence of a data point
- when performing a least-squares regression anaylsis
in ordinary least squares analysis
-
cooks D can be used to…
- indicate influential data points that are particularly worth checking for validity
- indicate regions of the design space where it would be good to be able obtain more data points
data points with large residuals (outliers) and/or high leverage
-
may distort the outcome and accuracy of a regression
\({\displaystyle {\underset {n\times 1}{\mathbf {y} }}={\underset {n\times p}{\mathbf {X} }}\quad {\underset {p\times 1}{\boldsymbol {\beta }}}\quad +\quad {\underset {n\times 1}{\boldsymbol {\varepsilon }}}}\)
Cook's distance \(D_i\) of observation $i$( for \(i=1,...,n)\)
-
is defined as the sum of all the changes
-
in the regression model observation \(i\) is removed from it
\({\displaystyle D_{i}={\frac {\sum _{j=1}^{n}\left({\widehat {y\,}}_{j}-{\widehat {y\,}}_{j(i)}\right)^{2}}{ps^{2}}}}\)
-
Sensitivity Analysis is the the model down-weighted by leverage?
- see how much the model changes
- indicators that it might be important (not that it is important)
Collinearity
-
it is almost always assumed there is collinearity
- otherwise then good
vif - Variance Inflation Factor
-
computed by standardized regression
-
picks up on larger sets that are in within the data
- not just the pairwise correlation
-
picks up on larger sets that are in within the data
Pearson Correlation Coefficients
principal components have the mathematic advantages of being uncorrelated
-
eigenvalues out of the correlation matrix
-
the total variability in the system can be boiled down to the sum of eigenvalues
- the portion of value eigenvalue relative to the total
-
the total variability in the system can be boiled down to the sum of eigenvalues
- each eigenvalue is equal to each dimension
Factor Patterns
- scores every factor of row vector?
orthogonal rotation preserves the variance and correlation
- oblique rotation does not
perform EDA => check for collinearity => model selection => validate assumptions plotting residuals -> model revision => cross validation
influential observation is an outlier