Goal: Practice regularization and variable selection for regression.


We will use the 2018 fuel consumption data, with MAKE, and CLASS as factor variables, lumped along their least frequent levels. The data is split along a training (70%) and test set.

fcr = read_csv("../data/2018 Fuel Consumption Ratings.csv") %>% filter( FUEL %in% c("X","Z") ) %>% 
  mutate( 
    MAKE = factor(MAKE) %>% fct_lump(20) %>% fct_relevel("Other"),
    CLASS = factor(CLASS) %>% fct_lump(10) %>%       fct_relevel("Other") ) %>%  
  select( COMB, MAKE, CLASS, ENGINE_SIZE, CYLINDERS) 
my_seed = 123
set.seed(my_seed)
train = fcr %>% sample_frac(.7)
test = setdiff(fcr, train)
  1. Fit a regression tree for COMB ~ ENGINE_SIZE + CYLINDERS + MAKE + CLASS with control parameters rpart.control( minsplit = 10, cp = 0). Report the standard deviation of the model errors for the train and test sets.

  2. You will use regularization based on the cross-validation results from the model you fit. Choose the complexity penatly (cp parameter) for the simplest model with a cross-validated error (xerror) withing one standard deviation (xstd) of the minimum error. Prune your model according to the cp value you chose, and report standard deviation of prediction errors for the train and test sets. Describe what happens to the in-sample and the out-of-sample performance

  3. The cptable of your model from part 1. contains a column with the number of splits (nsplit), the relative training error (rel error), and the cross-validated relative error (xerror). The following piece of code adds an extra column (test_error) with the relative out-of-sample error, caclulated based on the test set. Recreate the plot below showing the training, cross-validated, and test error, versus the tree model’s complexity (number of splits).

as_tibble( tree_out$cptable ) %>% 
  group_by(CP) %>% 
  mutate( test_error = test %>%
            add_predictions( prune(tree_out, CP)) %>%
            summarise( var(COMB - pred)) %>% 
            pull() / var( train$COMB ) )

The following code adds 30 randomly generated \(X\) variables to the fuel consumption data, and splits it into a new training and test set.

N = nrow(fcr); p = 30
random = data.frame( replicate( p, rnorm(N) ) )
fcr_rnd = fcr %>% 
  select(COMB, MAKE, CLASS, CYLINDERS, ENGINE_SIZE) %>% 
  bind_cols( random )
set.seed(my_seed)
train_rnd = fcr_rnd %>% sample_frac(.70)
test_rnd = setdiff( fcr_rnd, train_rnd)
  1. Fit a linear regression model for COMB versus MAKE, CLASS, CYLINDERS, ENGINE_SIZE, and all of the randomly generated \(X\) variables to the training data; report the training and test error standard deviation.

  2. Perform variable selection by running step() on your previous model, and report the resulting training and test error standard deviation. Did variable selection discard all of the irrelevant (random) variables?

  3. Fit the simple linear regression model for COMB ~ MAKE + CLASS + CYLINDERS + ENGINE_SIZE, and report the training and test error standard deviation.

LS0tDQp0aXRsZTogIlNUQUE1NyAtIFdvcmtzaGVldCAyMiINCmF1dGhvcjogJ05hbWU6ICAgICwgSUQjOiAgICcNCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KZWRpdG9yX29wdGlvbnM6IA0KICBjaHVua19vdXRwdXRfdHlwZTogaW5saW5lDQotLS0NCg0KYGBge3IsIGluY2x1ZGU9RkFMU0V9DQpsaWJyYXJ5KHRpZHl2ZXJzZSkNCmxpYnJhcnkobW9kZWxyKQ0KbGlicmFyeShicm9vbSkNCmBgYA0KDQoqKkdvYWwqKjogUHJhY3RpY2UgcmVndWxhcml6YXRpb24gYW5kIHZhcmlhYmxlIHNlbGVjdGlvbiBmb3IgcmVncmVzc2lvbi4gDQoNCioqKg0KDQpXZSB3aWxsIHVzZSB0aGUgMjAxOCBmdWVsIGNvbnN1bXB0aW9uIGRhdGEsIHdpdGggYE1BS0VgLCBhbmQgYENMQVNTYCBhcyBmYWN0b3IgdmFyaWFibGVzLCBsdW1wZWQgYWxvbmcgdGhlaXIgbGVhc3QgZnJlcXVlbnQgbGV2ZWxzLiBUaGUgZGF0YSBpcyBzcGxpdCBhbG9uZyBhIHRyYWluaW5nICg3MCUpIGFuZCB0ZXN0IHNldC4NCmBgYHtyLCByZXN1bHRzPSdoaWRlJ30NCmZjciA9IHJlYWRfY3N2KCIuLi9kYXRhLzIwMTggRnVlbCBDb25zdW1wdGlvbiBSYXRpbmdzLmNzdiIpICU+JSBmaWx0ZXIoIEZVRUwgJWluJSBjKCJYIiwiWiIpICkgJT4lIA0KICBtdXRhdGUoIA0KICAgIE1BS0UgPSBmYWN0b3IoTUFLRSkgJT4lIGZjdF9sdW1wKDIwKSAlPiUgZmN0X3JlbGV2ZWwoIk90aGVyIiksDQogICAgQ0xBU1MgPSBmYWN0b3IoQ0xBU1MpICU+JSBmY3RfbHVtcCgxMCkgJT4lICAgICAgIGZjdF9yZWxldmVsKCJPdGhlciIpICkgJT4lICANCiAgc2VsZWN0KCBDT01CLCBNQUtFLCBDTEFTUywgRU5HSU5FX1NJWkUsIENZTElOREVSUykgDQoNCm15X3NlZWQgPSAxMjMNCnNldC5zZWVkKG15X3NlZWQpDQp0cmFpbiA9IGZjciAlPiUgc2FtcGxlX2ZyYWMoLjcpDQp0ZXN0ID0gc2V0ZGlmZihmY3IsIHRyYWluKQ0KYGBgDQoNCjEuIEZpdCBhIHJlZ3Jlc3Npb24gdHJlZSBmb3IgYENPTUIgfiAgRU5HSU5FX1NJWkUgKyBDWUxJTkRFUlMgKyBNQUtFICsgQ0xBU1NgIHdpdGggY29udHJvbCBwYXJhbWV0ZXJzIGBycGFydC5jb250cm9sKCBtaW5zcGxpdCA9IDEwLCBjcCA9IDApYC4gUmVwb3J0IHRoZSBzdGFuZGFyZCBkZXZpYXRpb24gb2YgdGhlIG1vZGVsIGVycm9ycyBmb3IgdGhlIHRyYWluIGFuZCB0ZXN0IHNldHMuDQoNCg0KMi4gWW91IHdpbGwgdXNlICpyZWd1bGFyaXphdGlvbiogYmFzZWQgb24gdGhlIGNyb3NzLXZhbGlkYXRpb24gcmVzdWx0cyBmcm9tIHRoZSBtb2RlbCB5b3UgZml0LiBDaG9vc2UgdGhlIGNvbXBsZXhpdHkgcGVuYXRseSAoYGNwYCBwYXJhbWV0ZXIpIGZvciB0aGUgc2ltcGxlc3QgbW9kZWwgd2l0aCBhIGNyb3NzLXZhbGlkYXRlZCBlcnJvciAoYHhlcnJvcmApIHdpdGhpbmcgb25lIHN0YW5kYXJkIGRldmlhdGlvbiAoYHhzdGRgKSBvZiB0aGUgbWluaW11bSBlcnJvci4gUHJ1bmUgeW91ciBtb2RlbCBhY2NvcmRpbmcgdG8gdGhlIGBjcGAgdmFsdWUgeW91IGNob3NlLCBhbmQgcmVwb3J0IHN0YW5kYXJkIGRldmlhdGlvbiBvZiBwcmVkaWN0aW9uIGVycm9ycyBmb3IgdGhlIHRyYWluIGFuZCB0ZXN0IHNldHMuIERlc2NyaWJlIHdoYXQgaGFwcGVucyB0byB0aGUgaW4tc2FtcGxlIGFuZCB0aGUgb3V0LW9mLXNhbXBsZSBwZXJmb3JtYW5jZSANCg0KDQoNCjMuIFRoZSBgY3B0YWJsZWAgb2YgeW91ciBtb2RlbCBmcm9tIHBhcnQgMS4gY29udGFpbnMgYSBjb2x1bW4gd2l0aCB0aGUgbnVtYmVyIG9mIHNwbGl0cyAoYG5zcGxpdGApLCB0aGUgcmVsYXRpdmUgdHJhaW5pbmcgZXJyb3IgKGByZWwgZXJyb3JgKSwgYW5kIHRoZSBjcm9zcy12YWxpZGF0ZWQgcmVsYXRpdmUgZXJyb3IgKGB4ZXJyb3JgKS4gVGhlIGZvbGxvd2luZyBwaWVjZSBvZiBjb2RlIGFkZHMgYW4gZXh0cmEgY29sdW1uIChgdGVzdF9lcnJvcmApIHdpdGggdGhlIHJlbGF0aXZlICpvdXQtb2Ytc2FtcGxlKiBlcnJvciwgY2FjbHVsYXRlZCBiYXNlZCBvbiB0aGUgdGVzdCBzZXQuIFJlY3JlYXRlIHRoZSBwbG90IGJlbG93IHNob3dpbmcgdGhlIHRyYWluaW5nLCBjcm9zcy12YWxpZGF0ZWQsIGFuZCB0ZXN0IGVycm9yLCB2ZXJzdXMgdGhlIHRyZWUgbW9kZWwncyBjb21wbGV4aXR5IChudW1iZXIgb2Ygc3BsaXRzKS4gDQoNCmBgYHtyIH0NCmFzX3RpYmJsZSggdHJlZV9vdXQkY3B0YWJsZSApICU+JSANCiAgZ3JvdXBfYnkoQ1ApICU+JSANCiAgbXV0YXRlKCB0ZXN0X2Vycm9yID0gdGVzdCAlPiUNCiAgICAgICAgICAgIGFkZF9wcmVkaWN0aW9ucyggcHJ1bmUodHJlZV9vdXQsIENQKSkgJT4lDQogICAgICAgICAgICBzdW1tYXJpc2UoIHZhcihDT01CIC0gcHJlZCkpICU+JSANCiAgICAgICAgICAgIHB1bGwoKSAvIHZhciggdHJhaW4kQ09NQiApICkNCmBgYA0KDQohW10oaW1nL3JlbF94X2Vycm9yLlBORykNCg0KDQoNCioqKg0KDQpUaGUgZm9sbG93aW5nIGNvZGUgYWRkcyAzMCByYW5kb21seSBnZW5lcmF0ZWQgJFgkIHZhcmlhYmxlcyB0byB0aGUgZnVlbCBjb25zdW1wdGlvbiBkYXRhLCBhbmQgc3BsaXRzIGl0IGludG8gYSBuZXcgdHJhaW5pbmcgYW5kIHRlc3Qgc2V0Lg0KDQpgYGB7cn0NCk4gPSBucm93KGZjcik7IHAgPSAzMA0KcmFuZG9tID0gZGF0YS5mcmFtZSggcmVwbGljYXRlKCBwLCBybm9ybShOKSApICkNCmZjcl9ybmQgPSBmY3IgJT4lIA0KICBzZWxlY3QoQ09NQiwgTUFLRSwgQ0xBU1MsIENZTElOREVSUywgRU5HSU5FX1NJWkUpICU+JSANCiAgYmluZF9jb2xzKCByYW5kb20gKQ0KDQpzZXQuc2VlZChteV9zZWVkKQ0KdHJhaW5fcm5kID0gZmNyX3JuZCAlPiUgc2FtcGxlX2ZyYWMoLjcwKQ0KdGVzdF9ybmQgPSBzZXRkaWZmKCBmY3Jfcm5kLCB0cmFpbl9ybmQpDQpgYGANCg0KNC4gRml0IGEgKmxpbmVhciByZWdyZXNzaW9uKiBtb2RlbCBmb3IgYENPTUJgIHZlcnN1cyBgTUFLRWAsIGBDTEFTU2AsIGBDWUxJTkRFUlNgLCBgRU5HSU5FX1NJWkVgLCBhbmQgKmFsbCogb2YgdGhlIHJhbmRvbWx5IGdlbmVyYXRlZCAkWCQgdmFyaWFibGVzIHRvIHRoZSB0cmFpbmluZyBkYXRhOyByZXBvcnQgdGhlIHRyYWluaW5nIGFuZCB0ZXN0IGVycm9yIHN0YW5kYXJkIGRldmlhdGlvbi4NCg0KDQoNCjUuIFBlcmZvcm0gKnZhcmlhYmxlIHNlbGVjdGlvbiogYnkgcnVubmluZyBgc3RlcCgpYCBvbiB5b3VyIHByZXZpb3VzIG1vZGVsLCBhbmQgcmVwb3J0IHRoZSByZXN1bHRpbmcgdHJhaW5pbmcgYW5kIHRlc3QgZXJyb3Igc3RhbmRhcmQgZGV2aWF0aW9uLiBEaWQgdmFyaWFibGUgc2VsZWN0aW9uIGRpc2NhcmQgYWxsIG9mIHRoZSBpcnJlbGV2YW50IChyYW5kb20pIHZhcmlhYmxlcz8NCg0KDQo2LiBGaXQgdGhlIHNpbXBsZSBsaW5lYXIgcmVncmVzc2lvbiBtb2RlbCBmb3IgYENPTUIgfiANCk1BS0UgKyBDTEFTUyArIENZTElOREVSUyArIEVOR0lORV9TSVpFYCwgYW5kIHJlcG9ydCB0aGUgdHJhaW5pbmcgYW5kIHRlc3QgZXJyb3Igc3RhbmRhcmQgZGV2aWF0aW9uLiANCg0KDQo=