Lecture Goals

  • Perform comparisons of multiple groups
    • Compare means/proportions
  • Examine notion of variable (in)dependence
    • Test for independence of catergorical variables
  • Readings
    • ISRS: ch. 3.4, 4.4

Comparing Several Means

  • LFS data: are average earnings the same for different levels of education?

Education Levels

educ value description
0 0 to 8 years
1 Some secondary
2 Gr 11 to 13
3 Some post secondary
4 Post secondary certificate or diploma
5 University: bachelors degree
6 University: graduate degree

Comparing Multiple Means

  • Test equality of #\(m\) group means \(H_0: \mu_1 = \mu_2 = \cdots = \mu_m\)

  • Could test all pairs (\(\mu_1=\mu_2\), \(\mu_2=\mu_3\), etc) and reject \(H_0\) if any one is rejected
  • Instead, test for equality of all means simultaneously
    • Use ANalysis Of VAriance (ANOVA)
    • Test statistic based on inter/intra-group variance

ANOVA

  • Idea: compare variance (i.e. average distance) from common mean to that from individual means

Permutation Test for ANOVA

library(coin)
lfs %>% 
  mutate( educ = factor(educ) ) %>% 
  kruskal_test( hrlyearn ~ educ , data = ., 
                distribution = "approx" )
## 
##  Approximative Kruskal-Wallis Test
## 
## data:  hrlyearn by educ (0, 1, 2, 3, 4, 5, 6)
## chi-squared = 5350.5, p-value < 2.2e-16

Comparing Multiple Proportions

  • Is unemployment rate the same for different levels of education?

  • Contingency table contains frequencies of combinations of two categorical variables

lfs %>% 
  filter(lfsstat != 4 ) %>% 
  mutate( educ = factor(educ), empl = factor(lfsstat != 3)) %>% 
  xtabs( ~ empl + educ, data = .) 
##        educ
## empl        0     1     2     3     4     5     6
##   FALSE    33   251   502   200   574   683   296
##   TRUE    459  1816  6122  2237 10500 11457  5905

Stacked Barplot

lfs %>% filter(lfsstat != 4 ) %>% 
  mutate( educ = factor(educ), empl = (lfsstat != 3)) %>% 
  ggplot( aes(x = educ, fill = empl)) + geom_bar()

Contingency Tables

  • prop.table() for relative proportions
  • addmargins() for table totals
  • apply across rows/cols (margin = 1/2)
lfs %>% filter(lfsstat != 4 ) %>% 
  mutate( educ = factor(educ), empl = factor(lfsstat != 3)) %>% 
  xtabs( ~ empl + educ, data = .) %>% 
  prop.table( margin = 2 ) %>% 
  addmargins( 1 ) %>%  round(4)
##        educ
## empl         0      1      2      3      4      5      6
##   FALSE 0.0671 0.1214 0.0758 0.0821 0.0518 0.0563 0.0477
##   TRUE  0.9329 0.8786 0.9242 0.9179 0.9482 0.9437 0.9523
##   Sum   1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

Normalized Barplot

lfs %>% filter(lfsstat != 4 ) %>% 
  mutate( educ = factor(educ), empl = (lfsstat != 3)) %>% 
  ggplot( aes(x = educ, fill = empl)) + 
  geom_bar(position = "fill")

Mosaic Plot

library(ggmosaic)
lfs %>% filter(lfsstat != 4 ) %>% 
  mutate( educ = factor(educ), empl = (lfsstat != 3)) %>% 
  ggplot() + geom_mosaic(aes(x = product(educ), fill = empl)) + xlab("educ")

Comparing Several Proportions

  • Test equality of #\(m\) group proportions \(H_0: p_1 = p_2 = \cdots = p_m\)

  • Under \(H_0\), it does not matter which group you are in
    • Proportion are independent of groups
  • Equivalent to testing independence of categorical variables (prop./group)

Independence Test - Categorical Variables

  • Chi-square (\(\chi^2\)) test statistic measures distance of observed from expected (under indep.) contingency table values
    • Sampling distribution under \(H_0\) by simulation (sample observations according to independent proportions)
library(coin)
lfs %>% filter(lfsstat != 4 ) %>% 
  mutate( educ = factor(educ), empl = factor(lfsstat != 3)) %>% 
  chisq_test( empl ~ educ, data= ., distribution = "approx")
## 
##  Approximative Pearson Chi-Squared Test
## 
## data:  empl by educ (0, 1, 2, 3, 4, 5, 6)
## chi-squared = 212.93, p-value < 2.2e-16

Example

  • Chi-square test can be extended to multi-category variables
    • E.g. Are education and marital status related?
lfs %>% 
  mutate( educ = factor(educ), marstat = factor(marstat)) %>% 
  chisq_test( marstat ~ educ, data= ., distribution = "approx")
## 
##  Approximative Pearson Chi-Squared Test
## 
## data:  marstat by educ (0, 1, 2, 3, 4, 5, 6)
## chi-squared = 10696, p-value < 2.2e-16

Education vs Marital Status

General Independence Test

  • coin::independence_test() provides general framework for comparing two or more groups
    • Formula-based argument Y ~ X
    • Compare Y values across levels of factor X
  • \(Y\) variable type determines quantity to compare
    • Means for numeric \(Y\)
    • Proportions for factor \(Y\)
  • Can use independence_test() for tests of equality of 2+ means/proportions