Assumptions:

independent observations
normality of populations -> sufficient sample sizez makes this unimportant
equal variance for pooled \(t\)

Independent observations

Dependent (differnces)

\(t = \frac{\bar{x}_{\delta} - \mu_{\delta}} {S_{\frac{\delta}{n}}}\)

F statistic

\(F=\frac{max(s_{1}^{2}, s_{2}^{2})}{min(s_{1}^{2}, s_{2}^{2})}\)

Practice

using proc ttest to compare groups

30 selected students to recieve tutoring
- 15 received new type of training during tutorials
  - other 15 received standard tutoring
2 students moved away before completing the study
- scores on a standardized german grammar test were recorded immediately before the 12 week tutorials and again 12 weeks later at the end of the trial
  - using PROC TTEST, analyze `stat1.german` dataset
    - assess whether the treatment group improved more than the control group
  - do the 2 groups seem to be approximately normally distributed? NO/NOT ENOUGH TO THE USE POOLED TTEST
  - does the new teaching technique seem to result in significantly different scores compared with the standard technique? P-VALUE FOR THE POOLED TTEST FOR THE DIFFERENCE BETWEEN THE TWO MEANS SHOWS THAT THE TWO GROUPS ARE NOT STATISTICALLY SIGNIFICANTLY DIFFERENT NOT STRONG ENOUGH EVIDENCE TO SAY THE NEW TECHNIQUE IS DIFFERENT FROM THE PRIOR

proc univariate data=stat1.german;
     class group;
     var change;
     qqplot change / normal(mu=est sigma=est);
run;

proc ttest data=stat1.german;
     class group;
     var change;
run;

QUIZ

sample from a population should be…
- representative
predicitve modeling predicts future values of a response variable
- based on existing values of predictor variables
  - you assess prediction's accuracy using a holdout or validation dataset
    - and the model usually has many variables and a large sample size
the standard error measures the variability associated with the sample mean
- -\(\bar{x}\) is measured by the standard error
for a 95% confidence interval (15.02, 15.04) for the population mean, if the sample mean is 15.03 ounces…
- 95% confidence level means that 95% of theoretical inf number of intervals
  - would contain the true population mean, but 5% would not
    - any given sample, the calculated confidence CI might or might not
      - contain the value of the true population mean
Power is…
- the probability that you correctly reject the null hypothesis
the location and spread of a normal distrbution depend on the value of…
- the mean \(\mu\) and the standard deviation \(\sigma\)
a bank manager noticed that a % of loans processed contain errors increased above 1%, A significance test is conducted to test his concern \(H_{0}\) loan error rate <= 0.01 \(H_{a}\) loan error rate > 0.01
- Type I error (reject null hypothesis when its true)
  - Type II error (fails to reject null hypothesis when its false)
To reject a test a with student's \(t\) statistic, the \(t\) statistic should be far from zero and have a small corresponding p-value
The confidence bounds can be changed using the ALPHA= option in PROC TTEST
- 99% confidence == ALPHA=0.01

ANOVA and Regression

GLM: \(Y = x \beta\)

1 quantitative predictor = simple linear regression (bivariate "" "")

multiple quantitative predictors = mulitple regression

x has 1 categorical predictors = 1-way ANOVA

x has 2 categorical predictors = 2-way ANOVA

mix of categorical and quantitative = ANOVA

residual = \(y_{i} - \hat{y}_{i}\) SSError = \(\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}\)

minimize this for the least-squares

ANOVA = \(y = \mu + \tau_{i} + \varepsilon\)

Say check \(y\) for normality in each group
- \(\varepsilon\) assumed normal

/* 2 sample t-test is also a GLM/ANOVA question */
proc ttest data=german;
     class group;
     var change;
run;

"post-hoc" after the intial tests…

Dunnett's Method

Lower and Upper decision limits

P( at lesat 1 type I error ) \(\leq \alpha\)

Using Correlation to measure relationships between continous variables

By defintion:
- 2 continous variables
  - are correlated if there is a linear association between them remember! it is possible to have strong associations that are… nonlinear in nature!
We use correlation statistics
- which measure the degree, or strength of linear association between two variables

You can compute the Pearson correlation coefficient \(r\)

the closer to -1, the stronger the negative linear relationship between 2 variables
equal to 0 means no linear relationship exists between 2 variables (uncorrelated)
the closer to 1, the stronger the positive linear relationship between 2 variables
Lets consider Hypothesis Testing
- to determine whether the relationship between two variables
  - is statistically different from zero
- the population parameter that represents a correlation is \(p\)
  - and the correlation coefficient \(r\) is the sample statistic
    - that estimates \(p\)
- the null hypothesis for a test of a correlation coefficient is: \(H_{0}:p=0\)
- alternative hypothesis for a test of a correlation coefficient is: \(H_{a}:p \neq 0\) rejecting the null hypothesis suggests the true population correlation is…
  - statistically different from zero

REMEMBER! a p-value does NOT measure the magnitude of that association

To determine the strength of the association between 2 variables
- focus on \(r\) , the sample correlation
  - to see if its meaningfully large be cautious about the effect of sample sizes very large sample sizes result in small p-values
    - you would almost always reject the hypothesis that \(\rho\) is equal to zero
      - even if it is small for practical purpose
      the proper sample sizes varies across industries

VERY IMPORTANT: CORRELATION DOES NOT IMPLY CAUSATION

\(I\) \(II\) \(III\) \(IV\) \(x_{i}-\bar{x}\) + - - + \(y_{i}-\bar{y}\) + + - - \((x_{i}-\bar{x})(y_{i}-\bar{y})\) + - + -

Covariance: \(S_{xy}\) = \(\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})\)

Simple Linear Regression

\(y = \beta_{0} + \beta_{1}x + \varepsilon\) ^ quantitative

simple linear regression vs. baseline model

baseline is what \(\beta_{1}\) is without \(x\) \(H_{0}:\beta_{1}=0\)

Practice: using PROC REG to fit a simple linear regression model


proc reg data=bodyfat2;
     model pctbodyfat2=weight;
run;

Quiz

You can examine Levene's test for homogenity to more formally test
- the assumption of equal variances
Dunnet's method is a multiple comparison of means
- Tukey method compares all possible pairs of means
the model sums of squares (SSM), in one-way ANOVA
- is described as the variability between the groups
\(\beta_{1}\) represents the slope the parameter

Two-Way ANOVA

x- you use a one-way ANOVA to determine

whether there are significant differences between the means
- of two or more populations across different levels of a categorical variable

what if you have two categorical variables each with multiple levels?

using one-way ANOVA separately misses the possible interactions between the predictor variables

Two-Way ANOVA analyzes the effect of each predictor individually

and tests for interactions between them

an ANOVA with more than one predictor variable is called n-way ANOVA
- where \(n\) represents the number of categorical predictor variables
  - or factors included in the model

\(Y_{ijk} = \mu + \alpha_{i} + \beta_{j} + (\alpha\beta)_{ij} + \varepsilon_{ijk}\)

\(Y_{ijk} = \beta_{0} + \alpha_{i} h_{i} + \beta_{j}s_{j} + \beta_{j} + \delta_{ij} (h_{i}s_{i}) + \varepsilon_{ijk}\) \(\alpha_{m} = 0\) \(\beta_{n} = 0\) \(\delta_{mj} = 0\)

two factors become one with \(m \times n\) levesl \(\beta_{ij}(h_{i}s_{j}) + \varepsilon_{ijk}\)

Anov Aand Logistic Regression