L10 - Estimation

Lecture Goals

Understand fundamentals of statistical estimation
- Sampling distribution
- Point & interval estimates
Apply resampling methods for estimation
- Bootstrap confidence intervals
Readings
- ISRS: ch. 4.5
- ModernDive: ch. 9

Estimation

Interested in estimating value of parameter based on sample
Specific sample gives one out of all possible statistic values

Estimation

In reality, don't know sampling distribution, i.e. how statistic values are dispersed

Can only control two things
- Sampling method: randomness prevents bias
- Sample size: higher \(n\) improves accuracy

Labour Force Survey Data

Labour Force Survey (LFS): monthly survey providing crucial employment information
- Used to calculate unemployment rates
Access LFS microdata (individual responses) through UofT's CHASS Data Centre
File LFS_Toronto.csv contains 2018 LFS data for Toronto's Census Metropolitan Area (CMA)

lfs = read_csv('./data/LFS_Toronto.csv')

Example

Estimate Toronto unemployment rate: \(\frac{\textrm{# unemployed}}{\textrm{# labour force}}\)
Employment status given by variable lfsstat
- 1, 2: employed (working/on leave)
- 3: unemployed
- 4: Not in labour force

lfs %>% summarise( UNEMPL = sum(lfsstat == 3) / sum(lfsstat != 4) )

## # A tibble: 1 x 1
##   UNEMPL
##    <dbl>
## 1 0.0619

Estimation

Statistic gives single value, a.k.a. point estimate
Point estimates don't convey information about accuracy
- How close to parameter do we expect statistic to be?
- Need information on sampling distribution/variability
Two ways to assess sampling distribution
- Analytical: uses Probability Theory
- Resampling: uses sampling from the sample

Bootstrap

Bootstrap method: resample original SRS sample, where
- Each resample has same size as original sample
- Each resample is randomly selected with replacement

Calculate statistic for each bootstrap sample (i.e. resample), and treat them as values from sampling distribution

Example

Use infer package to bootstrap data-frames
- specify() selects variable(s)
- generate() resamples data
- calculate() calculates statistic

library(infer)
lfs_boot = lfs %>% filter( lfsstat %in% 1:3) %>%
  mutate( unemployed = (lfsstat == 3) ) %>%
  specify( response = unemployed, success = "TRUE" ) %>%
  generate( reps = 500, type = "bootstrap" ) %>%
  calculate( stat = "prop" ) %>% rename( UNEMPL = stat )
save(lfs_boot, file = "./data/lfsboot.R")

Example

Confidence Intervals

Confidence Interval (CI): interval computed based on sample in such a way that it contains parameter for specific proportion of all samples
Confidence level: proportion of samples whose interval contains parameter
- Controls CI width; typically set at 95%

Example

95% CI for Toronto unemployment rate

(CI = lfs_boot %>% summarise( lower = quantile(UNEMPL, .025),
                              upper = quantile(UNEMPL, .975)))

## # A tibble: 1 x 2
##    lower  upper
##    <dbl>  <dbl>
## 1 0.0595 0.0645

Example

CI's and Standard Errors

Most sampling distributions are symmetric with single peak around mean
In such cases, common to construct CI as: \(\textrm{point estimate }\pm\textrm{ margin of error}\)
- Margin of error (CI half-width) reflects estimation accuracy
For 95% confidence level, margin of error is approximately twice the standard error (SE)
- SE given by standard deviation of bootstrap samples, which measures the "average distance" from their mean

Example

95% CI for Toronto unemployment rate

# margin of error
(ME = (CI$upper - CI$lower)/2)
##       97.5% 
## 0.002455526

# standard error
(SE = sd( lfs_boot$UNEMPL ))
## [1] 0.001206671
2*SE
## [1] 0.002413342