L16 - Causality & Experiments

Lecture Goals

Understand concept of causality
- Distinguish causation from association
- Beware of confounding in observational studies
Use control variables to limit confounding
Use experiments to assess causal relations
Readings
- ISRS: ch. 1.3-5

Association & Causation

Association/Dependence describes any statistical relationship between variables
- E.g. if \(X\) is large, then \(Y\) will likely be large, & vice-versa
- Association is symmetric (\(X \leftrightarrow Y\))
Causation/Causality describes cause-effect relationship between variables
- E.g. \(X\) is (partially) responsible for changes in \(Y\)
- Causation is asymmetric (\(X \rightarrow Y\))

Association vs Causation

Type of relation determines its practical use
Associations used for prediction
- Just observe event & try to predict its outcome (\(Y\)) based on available information (\(X\))
Causual relationships used for intervention
- Can actually manipulate (\(X\)) in order to change event's outcome (\(Y\))

Regression

Regression model describes "relationship" between response (\(Y\)) and explanatory (\(X\)) variables
Plug-in nature of regression function (\(X\rightarrow f(X)=\hat{Y}\)) suggests that \(X\) causes \(Y\)
Causal interpretation is generally incorrect
- Unless data come from experiment, i.e. generated by intervention

Example

Consider hypothetical causal model with variables
- \(X\): ice cream sales
- \(Y\): wildfires area
- \(Z\): average temperature

and (linear) causal relationships

\[X = a + b Z + \textrm{err}_X, \quad Y = c + d Z + \textrm{err}_Y\]

Example

Simulate data from causal model (no intervention)

sim = tibble( Z = 30 + 10 * rnorm(100),
              X = 20 + 3 * Z + 20*rnorm(100), 
              Y = 50 + 1 * Z + 10*rnorm(100))

rnorm() generates real random numbers

tibble(x=rnorm(1000)) %>% ggplot(aes(x=x)) + geom_density()

Example

Simulated data scatterplots

Association

Variables \(X,Y\) are associated, by virtue of being caused by \(Z\)
- Relationship can be described by regression

lm( Y ~ X, data = sim) %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   60.1      3.71       16.2  1.75e-29
## 2 X              0.199    0.0322      6.19 1.44e- 8

Perfectly OK to use model for prediction:
- Predict wildfire severity based on ice cream sales
Not OK to use model for intervention
- Cannot prevent wildfires by banning ice cream

Causation

In an experiment, we can manipulate variables
- Intervention in causal variable propagated to affected variables
- Intervention in non-causal variable is inconsequential, and breaks up spurious associations

Example

Experiment: control causal variable \(Z\)

sim %>% 
  mutate( Z = seq( from = min(Z), to = max(Z), length.out = 100), 
          X = 20 + 3 * Z + 20*rnorm(100),
          Y = 50 + 1 * Z + 10*rnorm(100) ) %>% 
  lm(Y ~ Z, data = .) %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    50.0     2.43        20.6 2.51e-37
## 2 Z               1.03    0.0755      13.7 1.98e-24

Example (con't)

Experiment: control non-causal variable \(X\)

sim %>% 
  mutate( X = seq(min(X), max(X), length.out = 100))  %>% 
  lm(Y ~ X, data = .) %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  77.3       3.39       22.8  5.84e-41
## 2 X             0.0422    0.0280      1.50 1.36e- 1

Confounding Variables

Confounder: variable that influences both dependent & independent variables, causing spurious (i.e. non-causal) association
- E.g. variable \(Z\) considered previously
Confounding prevents us from confidently drawing causal inferences from observational data
- Confounding does not affect predictions
Effect of known confounders can be controlled by including them in the model
- Controlling does not address potential unknown confounders

Example

Include confounder (\(Z\)) without intervention

sim %>% 
  mutate( X = 20 + 3 * Z + 20*rnorm(100) )  %>% 
  lm(Y ~ X + Z, data = .) %>% tidy()
## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  56.0       3.48      16.1   4.21e-29
## 2 X            -0.0374    0.0538    -0.695 4.89e- 1
## 3 Z             0.983     0.198      4.96  3.05e- 6

Simpson's Paradox

Extreme version of confounding, where association between \(X\) & \(Y\) is reversed when additional variable \(Z\) is considered

Example

1973 UC Berkeley admissions data
- Apparent gender bias in acceptance rate

Example (cont'd)

Gender bias reverses when considering different departments (confounder)

Randomized Controlled Trials

Gold standard of scientific evidence is Randomized Controlled Trial (RCT)
Controll variables that can be manipulated
- Variables of interest are assigned to desired treatment, e.g. drug or placebo
- Nuisance variables are kept constant or distributed evenly (blocking), e.g. same gender ratio for each treatment
Randomize for variables that cannot be manipulated
- Randomly assign subjects to treatments, to account for variables that cannot be controlled (prevent bias)

Statistical Skepticism

(https://imgs.xkcd.com/comics/correlation.png)