Lecture Goals

  • Understand fundamental concepts of statistical inference
    • Population vs Sample
    • Parameter vs Statistic
    • Sampling Variability & Bias
  • Apply basic sampling strategies
    • Simple and Stratified Random Sampling
  • Readings

Why Statistics?

  • Goal: answer questions about a population
    • Population: collection of all objects of interest (actual or notional)
    • E.g. What is average age of people living in Canada?
  • Precise answer for entire population requires census
    • E.g. Ask everyone living in Canada (cost ~$700mil!)
    • Typically either impracticable or impossible
  • Alternatively sample the population

Sampling

  • Sample is subset of population that helps answer question "approximately"

  • Answer quality depends critically on how sample is collected

Parameters & Statistics

  • Answers typically expressed as statements about summary measure of interest
    • Parameter: summary of entire population
    • (Sample) Statistic: summary of sample

Notation

  • Continuous Variables
Description Parameter Statistic
Mean/Avg \(\mu\) \(\hat{\mu}/\bar{X}\)
Std Deviation \(\sigma\) \(S\)
Median \(\mu_{1/2}\) \(M\)
  • Discrete Variables
Description Parameter Statistic
Proportion \(p\) \(\hat{p}\)

Example

  • Dinesafe data: interested in average number of inspections per establishment
pop = dinesafe %>% 
  group_by(ESTABLISHMENT_ID) %>% 
  distinct(INSPECTION_ID) %>% 
  summarise( N_INSPECTIONS = n() ) 
pop %>% summarise( mean(N_INSPECTIONS) ) 
## # A tibble: 1 x 1
##   `mean(N_INSPECTIONS)`
##                   <dbl>
## 1                  3.41
pop %>% sample_n(100) %>% summarise( mean(N_INSPECTIONS) )
## # A tibble: 1 x 1
##   `mean(N_INSPECTIONS)`
##                   <dbl>
## 1                  3.38

Variability

  • Statistic value varies with different samples
    • Sampling variability is extent to which statistics diverge from their mean
  • Sampling variability can be controlled by the sample size (\(n\))
    • Larger \(n\) \(\rightarrow\) lower variability (higher accuracy)

Example

  • Distribution of statistics (avg # of inspections) of different sample sizes

Bias

  • Statistic changes with different samples, so how do we pick our sample?

  • Regularities in sampling can lead to bias, i.e. systematic deviation of statistic from parameter
    • E.g. collect data by email only
  • To avoid selection bias & improve representativeness, most sampling methods involve randomness

Simple Random Sampling

  • Sampling frame, i.e. list of available objects for sampling
    • Ideally covers entire population
  • Simple Random Sample (SRS): every object sampled randomly and with equal probability
    • Avoids bias when no other information is available
    • Average of all SRS averages equal to population average

Example

  • Distribution of statistics from different sampling frames (SRS, \(n=100\))

Potential Problems

  • Randomness alone is not enough for representativeness
    • Must pay attention to sampling details

Two common sources of bias are:

  • Participation or Non-response bias: respondents are not representative of entire population

  • Coverage bias: sampling frame does not align well with population

Stratified Sampling

  • Population often divided into groups, called strata

  • Stratified Sampling combines SRS from every straturm to ensure representation
    • Samples proportional to strata sizes

Example

## # A tibble: 3 x 3
##   MINIMUM_INSPECTIONS_PERYEAR `mean(N_INSPECTIONS)` `n()`
##                         <int>                 <dbl> <int>
## 1                           1                  1.66  3887
## 2                           2                  3.45  8775
## 3                           3                  5.21  3629

Observational vs Experimental Studies

  • Observational studies collect data by observing what happens (no intervention)
    • E.g. survey sampling, polls, etc.
  • Experiments collect data after manipulating aspects of the process
    • E.g. drug testing
  • Observational studies are used for descriptions (what is happening), whereas experiments are used for decisions (what should be done)