L22 - Model Selection

Lecture Goals

Understand bias-variance trade-off in modeling
Compare models based on predictive performance
Perform variable selection
Readings
- ISLR ch 7.1-3

Model Selection

Model Selecion: process of selecting appropriate model for improving predictive performance
- Different from model fitting/learning (i.e. estimating parameters)
Involves controlling model complexity
- Regularization controls flexibility
- Variable selection controls information

Model Selection

Model too simple \(\rightarrow\) Bias
- Cannot accurately describe relationship
Model too complex \(\rightarrow\) Variance
- Cannot accurately estimate relationship

Predictive Performance

In-sample performance (MSE/\(R^2\)) always improves with complexity
- Want model to perform well out-of-sample

Bias-Variance Trade-Off

Another perspective

Bias-Variance Trade-Off

Another perspective

Regularization

Regulatization: penalize fit with model complexity
- E.g. select model that minimizes:
  \[\text{(training error) + (complexity penalty)}\]
Optimal penalty is unknown, needs to be estimated
Choose penalty based on out-of-sample performance
- Theoretical: Adjusted \(R^2\), various criteria (AIC/BIC)
- Empirical: Validation set, Cross-Validation

Example

Regression tree for hourly earnings (LFS data)
- In-/out-of-sample performance without regularization

train = lfs %>% sample_frac(.70)
test = setdiff( lfs, train)
library(rpart) 
tree_out = rpart( hrlyearn ~ immig + age_6 + sex + marstat + 
                    educ + naics_21 + noc_10 + noc_40, data = train,
                  control = rpart.control( minsplit = 20, cp = 0))
library(modelr)
train %>% add_predictions( tree_out) %>% 
  summarise( sd(hrlyearn - pred)) %>% pull()
## [1] 5.355669
test %>% add_predictions( tree_out) %>% 
  summarise( sd(hrlyearn - pred)) %>% pull()
## [1] 8.535006

Example

In-/out-of- sample performance with regularization

xerror = tree_out$cptable[,"xerror"]; xstd = tree_out$cptable[,"xstd"]
cp_ind = min( which( xerror - xstd < min(xerror) ) )
cp_opt = tree_out$cptable[ cp_ind, "CP"]

tree_reg = prune( tree_out, cp = cp_opt )

train %>% add_predictions( tree_reg) %>% 
  summarise( sd(hrlyearn - pred)) %>% pull()
## [1] 5.82472
test %>% add_predictions( tree_reg) %>% 
  summarise( sd(hrlyearn - pred)) %>% pull()
## [1] 8.344568

Example

Cross-validated error from rpart()

plotcp(tree_out, )

Variable selection

Adding explanatory variables never hurts in-sample performance
- Similar to increasing model complexity
- However, too many variables can harm predictions
Variable Selection chooses best subset of variables
- Optimize w.r.t. out-of-sample performance
step() function in R performs variable selection
- Input full model (all possible variables)
- Output "optimal" trimmed down model

Example

Add randomly generated X, Z & fit linear model

lfs_ = lfs %>% mutate( X = rnorm(nrow(lfs)), Z = rnorm(nrow(lfs)))
train_ = lfs_ %>% sample_frac(.50); test_ = setdiff( lfs_, train_)
lm_out = lm( hrlyearn ~ immig + age_6 + sex + marstat + 
  educ + noc_10 + X + Z, data = train_)
lm_out %>% glance() %>% pull(r.squared)
## [1] 0.4035515
train_ %>% add_predictions( lm_out) %>% 
  summarise( sd(hrlyearn - pred)) %>% pull()
## [1] 7.345609
test_ %>% add_predictions( lm_out ) %>% 
  summarise( sd( hrlyearn - pred)) %>% pull()
## [1] 7.373807

Example

Run through step() to remove "redundant" variables

lm_step = step(lm_out, trace = 0)
lm_step$anova # see removed variables
##   Step Df Deviance Resid. Df Resid. Dev      AIC
## 1      NA       NA      3767   204878.4 15208.20
## 2  - X  1 41.00885      3768   204919.4 15206.96
## 3  - Z  1 65.31969      3769   204984.7 15206.17
lm_out %>% glance() %>% pull(r.squared)
## [1] 0.4035515
train_ %>% add_predictions( lm_step ) %>% 
  summarise( sd(hrlyearn - pred)) %>% pull()
## [1] 7.347515
test_ %>% add_predictions( lm_step ) %>% 
  summarise( sd( hrlyearn - pred)) %>% pull()
## [1] 7.370567