- Understand in-/out-of-sample performance
- Measure performance in train & test sets
- Prevent overfitting by
- Penalizing model complexity
- Using (cross) validation
- Readings
- ISLR ch. 5.1
Estimate out-of-sample performance by running model on unused data
train = wdbc %>% sample_frac(.8) test = wdbc %>% setdiff( train ) rpart_out = rpart(diagnosis ~ . - id, data = train) train %>% add_predictions(rpart_out, type = "class") %>% summarise( accuracy = mean( pred == diagnosis) ) %>% pull() ## [1] 0.9692308 test %>% add_predictions(rpart_out, type = "class") %>% summarise( accuracy = mean( pred == diagnosis) ) %>% pull() ## [1] 0.9473684
big_tree = rpart( diagnosis ~ . - id, data = train, control = rpart.control(minsplit = 1, cp = 0) ) rpart.plot(big_tree)
rpart()
big_tree$cptable ## CP nsplit rel error xerror xstd ## 1 0.814371257 0 1.00000000 1.0000000 0.06156478 ## 2 0.041916168 1 0.18562874 0.2634731 0.03775071 ## 3 0.017964072 3 0.10179641 0.1976048 0.03312768 ## 4 0.011976048 5 0.06586826 0.1676647 0.03069522 ## 5 0.008982036 6 0.05389222 0.1497006 0.02910597 ## 6 0.005988024 8 0.03592814 0.1377246 0.02798231 ## 7 0.003992016 12 0.01197605 0.1137725 0.02555041 ## 8 0.000000000 15 0.00000000 0.1257485 0.02679985
Regularization is another way to estimate out-of-sample performance
as_tibble(big_tree$cptable ) %>% mutate( pick = (xerror < min(xerror) + xstd ) ) ## # A tibble: 8 x 6 ## CP nsplit `rel error` xerror xstd pick ## <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> ## 1 0.814 0 1 1 0.0616 FALSE ## 2 0.0419 1 0.186 0.263 0.0378 FALSE ## 3 0.0180 3 0.102 0.198 0.0331 FALSE ## 4 0.0120 5 0.0659 0.168 0.0307 FALSE ## 5 0.00898 6 0.0539 0.150 0.0291 FALSE ## 6 0.00599 8 0.0359 0.138 0.0280 TRUE ## 7 0.00399 12 0.0120 0.114 0.0256 TRUE ## 8 0 15 0 0.126 0.0268 TRUE
final_tree = prune(big_tree, cp = .006) test %>% add_predictions( final_tree, type = "class") %>% mutate( pred = fct_relevel(pred, "M"), diagnosis = fct_relevel(diagnosis, "M") ) %>% xtabs( ~ pred + diagnosis, data = .) %>% prop.table() ## diagnosis ## pred M B ## M 0.35964912 0.01754386 ## B 0.03508772 0.58771930
Split data into training/training+validation & test set
Fit models of different complexity on training set
Choose optimal model complexity using regularization/cross-validation
Estimate out-of-sample performance of final model on test set