Lecture Goals

  • Understand concept of causality
    • Distinguish causation from association
    • Beware of confounding in observational studies
  • Use control variables to limit confounding
  • Use experiments to assess causal relations

  • Readings

Association & Causation

  • Association/Dependence describes any statistical relationship between variables
    • E.g. if \(X\) is large, then \(Y\) will likely be large, & vice-versa
    • Association is symmetric (\(X \leftrightarrow Y\))
  • Causation/Causality describes cause-effect relationship between variables
    • E.g. \(X\) is (partially) responsible for changes in \(Y\)
    • Causation is asymmetric (\(X \rightarrow Y\))

Association vs Causation

  • Type of relation determines its practical use

  • Associations used for prediction
    • Just observe event & try to predict its outcome (\(Y\)) based on available information (\(X\))
  • Causual relationships used for intervention
    • Can actually manipulate (\(X\)) in order to change event's outcome (\(Y\))

Regression

  • Regression model describes "relationship" between response (\(Y\)) and explanatory (\(X\)) variables

  • Plug-in nature of regression function (\(X\rightarrow f(X)=\hat{Y}\)) suggests that \(X\) causes \(Y\)

  • Causal interpretation is generally incorrect
    • Unless data come from experiment, i.e. generated by intervention

Example

  • Consider hypothetical causal model with variables
    • \(X\): ice cream sales
    • \(Y\): wildfires area
    • \(Z\): average temperature

and (linear) causal relationships

\[X = a + b Z + \textrm{err}_X, \quad Y = c + d Z + \textrm{err}_Y\]

Example

  • Simulate data from causal model (no intervention)
sim = tibble( Z = 30 + 10 * rnorm(100),
              X = 20 + 3 * Z + 20*rnorm(100), 
              Y = 50 + 1 * Z + 10*rnorm(100))
  • rnorm() generates real random numbers
tibble(x=rnorm(1000)) %>% ggplot(aes(x=x)) + geom_density()

Example

  • Simulated data scatterplots

Association

  • Variables \(X,Y\) are associated, by virtue of being caused by \(Z\)
    • Relationship can be described by regression
lm( Y ~ X, data = sim) %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   60.1      3.71       16.2  1.75e-29
## 2 X              0.199    0.0322      6.19 1.44e- 8
  • Perfectly OK to use model for prediction:
    • Predict wildfire severity based on ice cream sales
  • Not OK to use model for intervention
    • Cannot prevent wildfires by banning ice cream

Causation

  • In an experiment, we can manipulate variables
    • Intervention in causal variable propagated to affected variables
    • Intervention in non-causal variable is inconsequential, and breaks up spurious associations

Example

  • Experiment: control causal variable \(Z\)
sim %>% 
  mutate( Z = seq( from = min(Z), to = max(Z), length.out = 100), 
          X = 20 + 3 * Z + 20*rnorm(100),
          Y = 50 + 1 * Z + 10*rnorm(100) ) %>% 
  lm(Y ~ Z, data = .) %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    50.0     2.43        20.6 2.51e-37
## 2 Z               1.03    0.0755      13.7 1.98e-24

Example (con't)

  • Experiment: control non-causal variable \(X\)
sim %>% 
  mutate( X = seq(min(X), max(X), length.out = 100))  %>% 
  lm(Y ~ X, data = .) %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  77.3       3.39       22.8  5.84e-41
## 2 X             0.0422    0.0280      1.50 1.36e- 1

Confounding Variables

  • Confounder: variable that influences both dependent & independent variables, causing spurious (i.e. non-causal) association
    • E.g. variable \(Z\) considered previously
  • Confounding prevents us from confidently drawing causal inferences from observational data
    • Confounding does not affect predictions
  • Effect of known confounders can be controlled by including them in the model
    • Controlling does not address potential unknown confounders

Example

  • Include confounder (\(Z\)) without intervention
sim %>% 
  mutate( X = 20 + 3 * Z + 20*rnorm(100) )  %>% 
  lm(Y ~ X + Z, data = .) %>% tidy()
## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  56.0       3.48      16.1   4.21e-29
## 2 X            -0.0374    0.0538    -0.695 4.89e- 1
## 3 Z             0.983     0.198      4.96  3.05e- 6

Simpson's Paradox

  • Extreme version of confounding, where association between \(X\) & \(Y\) is reversed when additional variable \(Z\) is considered

Example

  • 1973 UC Berkeley admissions data
    • Apparent gender bias in acceptance rate

Example (cont'd)

  • Gender bias reverses when considering different departments (confounder)

Randomized Controlled Trials

  • Gold standard of scientific evidence is Randomized Controlled Trial (RCT)

  • Controll variables that can be manipulated
    • Variables of interest are assigned to desired treatment, e.g. drug or placebo
    • Nuisance variables are kept constant or distributed evenly (blocking), e.g. same gender ratio for each treatment
  • Randomize for variables that cannot be manipulated
    • Randomly assign subjects to treatments, to account for variables that cannot be controlled (prevent bias)

Statistical Skepticism