Goal: Practice measuring out-of-sample performance and avoiding overfitting.
The Indian Liver Patient (ILP) data contain 583 records of 10 features based on various medical tests, plus the variable patient
indicating if the individual is a liver patient.
ilp = read_csv("data/ilp.csv") %>%
mutate( patient = factor(patient))
glimpse(ilp)
Observations: 583
Variables: 11
$ Age [3m[38;5;246m<dbl>[39m[23m 65, 62, 62, 58, 72, 46, 26, 29, 17, 55, 57, 72...
$ Gender [3m[38;5;246m<chr>[39m[23m "Female", "Male", "Male", "Male", "Male", "Mal...
$ TB [3m[38;5;246m<dbl>[39m[23m 0.7, 10.9, 7.3, 1.0, 3.9, 1.8, 0.9, 0.9, 0.9, ...
$ DB [3m[38;5;246m<dbl>[39m[23m 0.1, 5.5, 4.1, 0.4, 2.0, 0.7, 0.2, 0.3, 0.3, 0...
$ Alkphos [3m[38;5;246m<dbl>[39m[23m 187, 699, 490, 182, 195, 208, 154, 202, 202, 2...
$ Sgpt [3m[38;5;246m<dbl>[39m[23m 16, 64, 60, 14, 27, 19, 16, 14, 22, 53, 51, 31...
$ Sgot [3m[38;5;246m<dbl>[39m[23m 18, 100, 68, 20, 59, 14, 12, 11, 19, 58, 59, 5...
$ TP [3m[38;5;246m<dbl>[39m[23m 6.8, 7.5, 7.0, 6.8, 7.3, 7.6, 7.0, 6.7, 7.4, 6...
$ ALB [3m[38;5;246m<dbl>[39m[23m 3.3, 3.2, 3.3, 3.4, 2.4, 4.4, 3.5, 3.6, 4.1, 3...
$ `A/G` [3m[38;5;246m<dbl>[39m[23m 0.90, 0.74, 0.89, 1.00, 0.40, 1.30, 1.00, 1.10...
$ patient [3m[38;5;246m<fct>[39m[23m TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE...
Split the data into training and testing (80%-20%). Fit a classification tree (default parameters), and report its training and test set accuracy.
Select an optimal tree model using cross-validation. Use the control options rpart.control(minsplit = 1, cp = 0)
to grow the full tree, and then prune it down to select the one with the minimum cross-validation error. Report your new modelโs training and test set accuracy.
Create 95% confidence interval for the test accuracy of your optimal tree, using bootstrap 1000 samples (can use infer
package) . Does the optimal model improve significantly over the first one?
Plot the optimal tree model; does it help predict liver disease? Justify your answer.
We will use a simulation experiment to explore overfitting. Consider \(N=1000\) observations with random (equiprobable) binary response variable (Y
), and \(p=250\) unrelated random features (V1:V250
).
N = 1000; p = 250;
set.seed(123)
toy = as_tibble( matrix( rnorm(N*p), ncol = p) ) %>%
mutate( Y = sample( c(0,1), size = N, replace = TRUE ) )
What should be the out-of-sample performance of any classifier applied to this data? Explain.
Split the data into training and test sets (80%-20%). Fit a logistic regression model using all 250 feature variables, and report its training and test accuracy. Do you believe there is overfitting?
Repeat the previous part using only the first 50 feature variables. Do you see any difference? Explain why you think that is. (Hint: select
the first 50 columns to feed into the model)
Use cross-validation to estimate the out-of-sample error of the model with all 250 features. Split the training data into 5 non-overlapping folds, fit the model to each fold and calculate its error, and finally report the average cross-validation error over all folds.
Simulate \(N=5000\) observations (rather than 1000) and fit the model with all 250 variables again. Report the training and test error (80%-20% split), what do you observe?