SUMMARY
-
use and limitations of GLM for categorical respsonses
-
the general form of the binary logitstic regression model, including its formulation as a GzLM and its equivalent forms for modeling odds and probabilities
-
intepreting parameters estimates in terms of odds ratios
-
the form of the generalized logistics model
-
its application to nominal responses
- and interpretation of parameter estimates
-
its application to nominal responses
-
the form of cumulative logistic model
-
its application to ordinal responses
- and interpretation of parameter estimates
-
its application to ordinal responses
-
the nature of proportional odds assumption and how it test its appropriateness
WARMUP Categorical or Nominal
-
categorical or called nominal is one that has 2 or more categories
- without instrinsic ordering to the categories
purely nominal variable is one that simply allows you to assign categories but you cannot clearly order the categories
Ordinal
-
ordinal variable is one that has two or more categories
- with clear ordering of the categories
if categories are equally spaced, then the variable would be an interval
GLM applied to Categorical Analysis
-
when the response is categorical
- the assumptions for GLM would seem to render it inappropriate
-
modeling it would not have the assumption
- to have a qualitative, normally distributed response
- Modeling Binary Categories with a GLM
- Using GLM Results to Classify Binary Outcomes
- Modeling Multiple Classes with a GLM
Logistic Regression Models
GLM \(y = x\beta+E\), use this if its categorical Generalized LM \(\gamma(y) = x\beta\), if \(\gamma(y)= y\)
the Kappa score (Kappa), sensitivity (SE), specificity (SP), precision (PR), accuracy (Acc), and F1-score metrics. The six metricsβ formulae are as follows:
πΎβ’π=(Tβ’N+Fβ‘N)Γ(Tβ’N+Fβ‘P)+(Tβ’P+Fβ‘P)Γ(Tβ’P+Fβ‘N)(NΓN) πΎβ’π=(Tβ’P+Tβ’N)π
where the πΎβ’π and Ko parameters are defined by using the data that have already been collected to figure out how likely it is that each observer is randomly perceived in each category.
Therefore, the Ko parameter is the proportion of raters who agree, and the πΎβ’π parameter is the chance that they will agree. For πΎβ’π and πΎβ’π, the Kappa score is defined as follows: πΎβ’πβ’πβ’πβ’π=(πΎβ’πβπΎβ’π)(1βπΎβ’π) (8)
The sensitivity is defined as: πβ’πΈ=πβ’π/(πβ’π+πΉβ‘π)
The specificity is defined as: πβ’π=πβ’π/(πβ’π+πΉβ‘π)
Precision is defined as: πβ’π =πβ’π/(πβ’π+πΉβ‘π)
Accuracy is defined as: π΄β’πΆβ’πΆ=(πβ’π+πβ’π)/(πβ’π+πβ’π+πΉβ‘π+πΉβ‘π)
F1-score is defined as: πΉβ‘1βπ β’πβ’πβ’πβ’π=2ΓPβ’RΓSβ’E/(Pβ’R+Sβ’E)
Reciever Operating Characteristic
- the true positive rate is also known as sensitivity or probability of detection
-
the false positive rate is also known as the probability of false alarm
- and equals ( 1 - specificity)
-
the ROC is also known as relative operating characteristic curve
-
because of two operating characteristics, true-positive rate and false-positive rate
- as the criterion changes
-
because of two operating characteristics, true-positive rate and false-positive rate
Logistic Regression for Binary Response
binary outcomes
Bernoulli (p)
\(P(y)=p^{y}(1-p)^{1-y}\) \(y=0,1\)
in the exponential arg, the func of the param Γ the variable y is \(log\frac{p}{1-p}\) that factor is referred to as the natural parameter and in this case it is the logistic function
generalized linear model
\(g(Y)=X\beta\)
\(g(p) = log(\frac{p}{1-p}) = X\beta\)
\(\frac{p}{1-p} = e^{X\beta}\) referred to ass the odds of success \(p = \frac{e^{X\beta}}{1+e^{X\beta}}\)
REMEMBER
\(\frac{x^{a}}{x^{b}} = x ^{a-b}\)
Fitting Binary Logistics Models in SAS
use either `PROC GENMOD` or `PROC LOGISTIC`
-
can be used to fit logistic regression models for binary responses
proc format; value $type 'Sedan', 'Wagon'='Car' 'Truck', 'SUV'='Truck' ; run; proc logistic data=cars; where type not in ('Sports', 'Hybrid'); model type = weight enginesize; format type $type.; run; proc genmod data=cars; where type not in ('Sports', 'Hybrid'); format type $type.; model type = weight enginesize / dist=binomial link=logit; run;
\(\delta_{1-3}\) for 3 categoriesβ¦
\(log(\frac{p}{1-p}) = \beta_{0} + \beta_{1}\delta_{1} + \beta_{2}\delta_{2} + \beta_{3}\delta_{3}\) ^ ^^^^^^^^^^^^^^^^^^^^^ overall average the ups and downs
restricted to: \(\hat{\beta_{1}} + \hat{\beta_{2}} + \hat{\beta_{3}} = 0\) ^ PROC LOGISTIC does this by default
Logistic Regression for Multi-Category Response
\(log\frac{p_{i}}{p_{k}}\) = x\(\beta\)
-
one category is (arbitrarily) chosen as a baseline
-
and ratios of probabilities in other categories
-
modelled as linear in the predictors
any other log-ratio can be constructed from this set
-
-
and ratios of probabilities in other categories
-
irrespective of which category is chosen as the baseline for the response
-
models for other probability ratios are differences of models from above equation
if we extend above equation and say \(\beta_{0k} = \beta_{1k} = \beta_{2k} = ... = \beta_{pk} = 0\) then⦠\(p_{i} = \frac{e^{x\beta_{i}}}{\sum_{j=0}^{k} e^{x\beta_{j}}}\)
-
if response is ordinal we have other choices to make
Cumulative Logit -> /k response categories, \(Y=1,2,3,...,k\) is an ordering/ranking
\(log(\frac{P(y \leq j)}{P(y > j)} = x \beta\) for \(j=1,2,...,k-1\)
OR \(log(\frac{P(y < j)}{P(y \geq j)} = x \beta\) for reverse inequalities \(j=2,3,...,k\)
commonly used link function for the ordinal case is the Adjacent Categories link
- similar to the baseline link
\(log(\frac{P_{j}}{P_{j+1}}) = x \beta\) for $j=1,2,β¦,k-1$x
Fitting Multi-Category Logistic Models in SAS
proc logistic data=cars;
model origin = horsepower weight mpg_city msrp length / link=glogit;
/* default link is cumulative link for multinomial responses
which is inappropriate for nominal variables
,glogit chooses the generalized (baseline) logit link */
ods exclude ModelInfo Nobs ConvergenceStatus GlobalTests FitStatistics;
run;
using PROC logistic to model blood pressure status
proc logistic data=heart;
/* takes bp_status as a multinomial case since response has more than two levels; link option requests cumulative logit link */
model bp_status = ageatstart weight / link=logit;
ods select ModelInfo ResponseProfile ParameterEstimates OddsRatio;
run;
for ordinal response, use LINK=ALOGIT; uses the adjacent category logit link
Most information criteria ( like AIC, SBC, β¦ ) are of the form model error & penalty
- complexity penalty (more predictors == penalty)
Title "No Proportional Odds";
proc logistic data=sashelp.heart;
format chol_status $CholReOrder.;
model chol_status = AgeAtStart Weight / link=logit unequalslopes;
ods select FitStatistics;
run;
Title "Proportional Odds for Weight Only";
proc logistic data=sashelp.heart;
format chol_status $CholReOrder.;
model chol_status = AgeAtStart Weight / link=logit unequalslopes=AgeAtStart;
ods select FitStatistics;
run;
Title "Proportional Odds for Age Only";
proc logistic data=sashelp.heart;
format chol_status $CholReOrder.;
model chol_status = AgeAtStart Weight / link=logit unequalslopes=Weight;
ods select FitStatistics;
run;
- UNEQUALSLOPES option makes the parameter different across all models for every predictor-the same structure of the linear component as in the generalized logit
- Specific effects can be included with UNEQUALSLOPES=, if more than one effect is desired, they must be listed inside a set of parentheses. Here the age variable is not presumed to follow proportional odds. It is also possible to specify EQUALSLOPES, but this is for models where proportional odds are not assumed and you want to force this condition.
AIC says do proportional odds for age (not weight)
SBC says do both as proportional odds for both (original model)
proc format;
value poverty
10 - high = 'high poverty'
other = 'low poverty';
run;
proc logistic data=cdi; descending;
format poverty poverty.;
class region / param=glm;
model poverty = region ba_bs over65;
lsmeans region / diff adjust=tukey exp cl;
run;
proc logistic data=cdi; descending;
format poverty poverty.;
class region / param=glm;
model poverty = region ba_bs over65;
lsmeans region / exp at (ba_bs over65) = (12 10);
run;
- exp translates log-odds and differences into odds and odds ratios