Lecture Goals

  • Understand fundamentals of statistical estimation
    • Sampling distribution
    • Point & interval estimates
  • Apply resampling methods for estimation
    • Bootstrap confidence intervals
  • Readings

Estimation

  • Interested in estimating value of parameter based on sample

  • Specific sample gives one out of all possible statistic values

Estimation

  • In reality, don't know sampling distribution, i.e. how statistic values are dispersed
  • Can only control two things
    • Sampling method: randomness prevents bias
    • Sample size: higher \(n\) improves accuracy

Labour Force Survey Data

  • Labour Force Survey (LFS): monthly survey providing crucial employment information
    • Used to calculate unemployment rates
  • Access LFS microdata (individual responses) through UofT's CHASS Data Centre

  • File LFS_Toronto.csv contains 2018 LFS data for Toronto's Census Metropolitan Area (CMA)

lfs = read_csv('./data/LFS_Toronto.csv') 

Example

  • Estimate Toronto unemployment rate: \(\frac{\textrm{# unemployed}}{\textrm{# labour force}}\)
  • Employment status given by variable lfsstat
    • 1, 2: employed (working/on leave)
    • 3: unemployed
    • 4: Not in labour force
lfs %>% summarise( UNEMPL = sum(lfsstat == 3) / sum(lfsstat != 4) )
## # A tibble: 1 x 1
##   UNEMPL
##    <dbl>
## 1 0.0619

Estimation

  • Statistic gives single value, a.k.a. point estimate

  • Point estimates don't convey information about accuracy
    • How close to parameter do we expect statistic to be?
    • Need information on sampling distribution/variability
  • Two ways to assess sampling distribution
    • Analytical: uses Probability Theory
    • Resampling: uses sampling from the sample

Bootstrap

  • Bootstrap method: resample original SRS sample, where
    • Each resample has same size as original sample
    • Each resample is randomly selected with replacement
  • Calculate statistic for each bootstrap sample (i.e. resample), and treat them as values from sampling distribution

Example

  • Use infer package to bootstrap data-frames
    • specify() selects variable(s)
    • generate() resamples data
    • calculate() calculates statistic
library(infer)
lfs_boot = lfs %>% filter( lfsstat %in% 1:3) %>%
  mutate( unemployed = (lfsstat == 3) ) %>%
  specify( response = unemployed, success = "TRUE" ) %>%
  generate( reps = 500, type = "bootstrap" ) %>%
  calculate( stat = "prop" ) %>% rename( UNEMPL = stat )
save(lfs_boot, file = "./data/lfsboot.R")

Example

Confidence Intervals

  • Confidence Interval (CI): interval computed based on sample in such a way that it contains parameter for specific proportion of all samples
  • Confidence level: proportion of samples whose interval contains parameter
    • Controls CI width; typically set at 95%

Example

  • 95% CI for Toronto unemployment rate
(CI = lfs_boot %>% summarise( lower = quantile(UNEMPL, .025),
                              upper = quantile(UNEMPL, .975)))
## # A tibble: 1 x 2
##    lower  upper
##    <dbl>  <dbl>
## 1 0.0595 0.0645

Example

CI's and Standard Errors

  • Most sampling distributions are symmetric with single peak around mean

  • In such cases, common to construct CI as: \(\textrm{point estimate }\pm\textrm{ margin of error}\)
    • Margin of error (CI half-width) reflects estimation accuracy
  • For 95% confidence level, margin of error is approximately twice the standard error (SE)
    • SE given by standard deviation of bootstrap samples, which measures the "average distance" from their mean

Example

  • 95% CI for Toronto unemployment rate
# margin of error
(ME = (CI$upper - CI$lower)/2)
##       97.5% 
## 0.002455526

# standard error
(SE = sd( lfs_boot$UNEMPL ))
## [1] 0.001206671
2*SE
## [1] 0.002413342

Example