L9 - Statistical Sampling

Lecture Goals

Understand fundamental concepts of statistical inference
- Population vs Sample
- Parameter vs Statistic
- Sampling Variability & Bias
Apply basic sampling strategies
- Simple and Stratified Random Sampling
Readings
- ISRS: ch. 1.3-1.4

Why Statistics?

Goal: answer questions about a population
- Population: collection of all objects of interest (actual or notional)
- E.g. What is average age of people living in Canada?
Precise answer for entire population requires census
- E.g. Ask everyone living in Canada (cost ~$700mil!)
- Typically either impracticable or impossible
Alternatively sample the population

Sampling

Sample is subset of population that helps answer question "approximately"
Answer quality depends critically on how sample is collected

Parameters & Statistics

Answers typically expressed as statements about summary measure of interest
- Parameter: summary of entire population
- (Sample) Statistic: summary of sample

Notation

Continuous Variables

Description	Parameter	Statistic
Mean/Avg	$\mu$	$\hat{\mu}/\bar{X}$
Std Deviation	$\sigma$	$S$
Median	$\mu_{1/2}$	$M$

Discrete Variables

Description	Parameter	Statistic
Proportion	$p$	$\hat{p}$

Example

Dinesafe data: interested in average number of inspections per establishment

pop = dinesafe %>% 
  group_by(ESTABLISHMENT_ID) %>% 
  distinct(INSPECTION_ID) %>% 
  summarise( N_INSPECTIONS = n() ) 
pop %>% summarise( mean(N_INSPECTIONS) ) 
## # A tibble: 1 x 1
##   `mean(N_INSPECTIONS)`
##                   <dbl>
## 1                  3.41
pop %>% sample_n(100) %>% summarise( mean(N_INSPECTIONS) )
## # A tibble: 1 x 1
##   `mean(N_INSPECTIONS)`
##                   <dbl>
## 1                  3.38

Variability

Statistic value varies with different samples
- Sampling variability is extent to which statistics diverge from their mean
Sampling variability can be controlled by the sample size ($n$)
- Larger $n$ $\rightarrow$ lower variability (higher accuracy)

Example

Distribution of statistics (avg # of inspections) of different sample sizes

Bias

Statistic changes with different samples, so how do we pick our sample?
Regularities in sampling can lead to bias, i.e. systematic deviation of statistic from parameter
- E.g. collect data by email only
To avoid selection bias & improve representativeness, most sampling methods involve randomness

Simple Random Sampling

Sampling frame, i.e. list of available objects for sampling
- Ideally covers entire population
Simple Random Sample (SRS): every object sampled randomly and with equal probability
- Avoids bias when no other information is available
- Average of all SRS averages equal to population average

Example

Distribution of statistics from different sampling frames (SRS, $n=100$)

Potential Problems

Randomness alone is not enough for representativeness
- Must pay attention to sampling details

Two common sources of bias are:

Participation or Non-response bias: respondents are not representative of entire population
Coverage bias: sampling frame does not align well with population

Stratified Sampling

Population often divided into groups, called strata
Stratified Sampling combines SRS from every straturm to ensure representation
- Samples proportional to strata sizes

Example

## # A tibble: 3 x 3
##   MINIMUM_INSPECTIONS_PERYEAR `mean(N_INSPECTIONS)` `n()`
##                         <int>                 <dbl> <int>
## 1                           1                  1.66  3887
## 2                           2                  3.45  8775
## 3                           3                  5.21  3629

Observational vs Experimental Studies

Observational studies collect data by observing what happens (no intervention)
- E.g. survey sampling, polls, etc.
Experiments collect data after manipulating aspects of the process
- E.g. drug testing
Observational studies are used for descriptions (what is happening), whereas experiments are used for decisions (what should be done)

Description	Parameter	Statistic
Mean/Avg	\(\mu\)	\(\hat{\mu}/\bar{X}\)
Std Deviation	\(\sigma\)	\(S\)
Median	\(\mu_{1/2}\)	\(M\)