Model Selection

for GLM, choosing the best model amounts to choosing a best predictor set

predictors may come directly from collected data
- or may be constructed from them

we presume all candidate predictors are assembled with the data at the outset

Practice

For RealEstate data:

using any reasonable, potential predictor, find a "best" model for price
- use GLMSELECT to consider all criteria (stepwise)
compare best subsets in REG on \(R_{adj}^2\) , AIC, BIC, SBC

proc glmselect data=realestate;
     class ac highway quality; /*categorical vars*/
     model price =   sq_ft--highway /
        selection=stepwise(select=sl choose=cv) stats=(AIC AICC BIC SBC);
run;

proc reg data=realestate;
     model sq_ft--highway /
        selection=adjsq aic bic sbc;
     ods output subsetSelSummary=subsets;
run;

IN HPGENSELECT, use quality as response and find best model

proc hpgenselect data=realestate;
     model quality = price--year lot highway / dist=multinomial
                                                      link = logit;
     selection method=stepwise(slentry=0.2 slstay=0.2 choose=SBC);
run;
proc hpgenselect data=realestate;
     where quality in (1,2);
     model quality = price--year lot highway / dist=multinomial
                                                      link = logit;
     selection method=stepwise(slentry=0.2 slstay=0.2 choose=SBC);
run;
proc hpgenselect data=realestate;
     where quality in (2,3);
     model quality = price--year lot highway / dist=multinomial
                                                      link = logit;
     selection method=stepwise(slentry=0.2 slstay=0.2 choose=SBC);
run;
/** splitting adjacent link choices **/


/** often multicategory logit in the predictor selection is avoided **/

Wald Chi-Squared

classic approach to hypothesis testing

wald has the advantage of only requireing estimation
- lowering the computational burden
disadvantage is it is not invariant to changes in the representation of the null hypothesis
Wald Test assesses constraints on statistical parameters based on the weighted distance between the unrestricted estimate and its hypothesized value under the \(H_0\)
- Where the weight is the precision of the estimate
  
  The Larger The Weight Distance The Less Likely It Is That The Constraint Is True
it has a asymptotic \(X^2\) distribution under the \(H_0\)
- a fact that can be used to determine stastistical significance
  - test on a single parameter
    
    \(W = \frac{(\hat\theta - \theta_0)^2}{var(\hat\theta)}\)
    
    square root of the single-restriction Wald statistic can be understood as \(t\) - ratio
    - however not \(t\) - distributed except for the special case
      - of linear regression w/ normally distributed errors
        
        in general follows asymptotic \(z\) - distribution
        
        \(\sqrt{W} = \frac{\hat\theta - \theta_0}{se(\hat\theta)}\)
        
        where \(se(\hat\theta)\) is the standard error of the maximum likelihood estimate, the square root of the variance

test(s) on multiple parameters
- can test jointly multiple hypotheseses on single/multiple parameters
  - Let \(\hat\theta_n\) be our sample estimator of P parameters
    - \(\hat\theta_n\) is a \(P \times 1\) vector
- test of Q hypotheses on the P parameters is expressed as \(Q \times P\) matrix R
  
  \(H_0 : R\theta = r\) \(H_1 : R\theta \neq r\)

Logit

in mathematics, the logit (Logistic Unit) function is the inverse of the sigmoid function

\(logit(p)=\log(\frac{p}{1-p})\)