Goal: Practice multiclass classification and resampling methods (random forests, undersamping).
MNIST Data
We will use a subset of the MNIST database of handwritten digits. The goal is to create a model that can accurately recognise (predict) the digit corresponding to each picture. The data consists of the actual digit (\(Y \in {0,\ldots,9}\)), and 784 pixel intensities, from the 28-by-28 digit images. A sample of the data is visualized below (you only need to work with the mnist
data-frame, the remaining code is for plotting, adapted from here).
mnist = read_csv("../data/mnist.csv") %>%
mutate( digit = factor(digit))
Parsed with column specification:
cols(
.default = col_double()
)
See spec(...) for full column specifications.
pixels_gathered = mnist %>%
mutate( image = row_number() ) %>%
gather( pixel, value, -digit, -image) %>%
tidyr::extract(pixel, "pixel", "(\\d+)", convert = TRUE) %>%
mutate(pixel = pixel - 2,
x = pixel %% 28,
y = 28 - pixel %/% 28)
pixels_gathered %>% filter(image <= 3) %>%
ggplot(aes(x, y, fill = value)) + geom_tile() +
facet_wrap(~ image + digit, labeller = "label_both")
Split the mnist data into training and test sets (75%-25%). Fit a full tree to the training data, and used cross-validation to select the simplest model within one standard deviation of the lowest error. Report the confusion matrix and accuracy of the resulting model on the test data.
Now fit a random forest with 250 trees to the training data, and report the confusion matrix and accuracy on the test data. How does the model compare to the previous one?
Based on your confusion matrix from the previous part, which digit seems the most difficult to classify correctly. For that digit, what other digit does it get mistaken with more often?
Credit Card Data
These data come from a Taiwanese credit card company (from UCI’s ML data repo). The goal is to predict which clients will default on their credit card payments, based on their profile and past payments.
ccd = read_csv("../data/credit_card_default.csv") %>%
mutate( default = factor( default ))
Find the proportion of classes (default
0 or 1) in data. Split the data into training and test sets (80%-20% respectively), and report the proportion of classes within each set.
Fit a classification tree to the training data, using rpart()
with controls cp = .001' and minsplit =
1`. Apply the model to the test set and report the confusion matrix. Caclulate the test set accuracy, Precision, Recall, and F-measure (where positive is default).
Create a balanced version of the training set using undersampling. Keep all the minority class observations in the training set, and randomly sample an equal number of majority class observations; your resulting training set should be smaller but perfectly balanced. Fit a classification tree to this new training set and report the same measures (confusion matrix, accuracy, precision recall, and F1 measure) for the same test set.
How do the recalls and precisions of the original and undersampled models compare? Which model would you use if you were a bank manager choosing which clients to give credit to? Justify your answer.
LS0tDQp0aXRsZTogIlNUQUE1NyAtIFdvcmtzaGVldCAyMCINCmF1dGhvcjogJ05hbWU6ICAgICwgSUQjOiAgICcNCmRhdGU6ICcgRHVlICcNCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KZWRpdG9yX29wdGlvbnM6IA0KICBjaHVua19vdXRwdXRfdHlwZTogaW5saW5lDQoNCi0tLQ0KDQpgYGB7ciwgaW5jbHVkZT1GQUxTRX0NCmxpYnJhcnkodGlkeXZlcnNlKQ0KYGBgDQoNCioqKiANCg0KKipHb2FsKio6IFByYWN0aWNlIG11bHRpY2xhc3MgY2xhc3NpZmljYXRpb24gYW5kIHJlc2FtcGxpbmcgbWV0aG9kcyAocmFuZG9tIGZvcmVzdHMsIHVuZGVyc2FtcGluZykuDQoNCiMjIyMgTU5JU1QgRGF0YSANCg0KV2Ugd2lsbCB1c2UgYSBzdWJzZXQgb2YgdGhlIFtNTklTVF0oaHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvTU5JU1RfZGF0YWJhc2UpIGRhdGFiYXNlIG9mIGhhbmR3cml0dGVuIGRpZ2l0cy4gVGhlIGdvYWwgaXMgdG8gY3JlYXRlIGEgbW9kZWwgdGhhdCBjYW4gYWNjdXJhdGVseSByZWNvZ25pc2UgKHByZWRpY3QpIHRoZSBkaWdpdCBjb3JyZXNwb25kaW5nIHRvIGVhY2ggcGljdHVyZS4gVGhlIGRhdGEgY29uc2lzdHMgb2YgdGhlIGFjdHVhbCBkaWdpdCAoJFkgXGluIHswLFxsZG90cyw5fSQpLCBhbmQgNzg0IHBpeGVsIGludGVuc2l0aWVzLCBmcm9tIHRoZSAyOC1ieS0yOCBkaWdpdCBpbWFnZXMuIEEgc2FtcGxlIG9mIHRoZSBkYXRhIGlzIHZpc3VhbGl6ZWQgYmVsb3cgKHlvdSBvbmx5IG5lZWQgdG8gd29yayB3aXRoIHRoZSBgbW5pc3RgIGRhdGEtZnJhbWUsIHRoZSByZW1haW5pbmcgY29kZSBpcyBmb3IgcGxvdHRpbmcsIGFkYXB0ZWQgZnJvbSBbaGVyZV0oaHR0cHM6Ly9kem9uZS5jb20vYXJ0aWNsZXMvZXhwbG9yaW5nLWhhbmR3cml0dGVuLWRpZ2l0LWNsYXNzaWZpY2F0aW9uLWEtdGlkeSkpLg0KDQpgYGB7cn0NCm1uaXN0ID0gcmVhZF9jc3YoIi4uL2RhdGEvbW5pc3QuY3N2IikgJT4lIA0KICBtdXRhdGUoIGRpZ2l0ID0gZmFjdG9yKGRpZ2l0KSkNCg0KcGl4ZWxzX2dhdGhlcmVkID0gbW5pc3QgJT4lIA0KICBtdXRhdGUoIGltYWdlID0gcm93X251bWJlcigpICkgJT4lDQogIGdhdGhlciggcGl4ZWwsIHZhbHVlLCAtZGlnaXQsIC1pbWFnZSkgJT4lDQogIHRpZHlyOjpleHRyYWN0KHBpeGVsLCAicGl4ZWwiLCAiKFxcZCspIiwgY29udmVydCA9IFRSVUUpICU+JQ0KICBtdXRhdGUocGl4ZWwgPSBwaXhlbCAtIDIsDQogICAgICAgICB4ID0gcGl4ZWwgJSUgMjgsDQogICAgICAgICB5ID0gMjggLSBwaXhlbCAlLyUgMjgpDQoNCnBpeGVsc19nYXRoZXJlZCAlPiUgZmlsdGVyKGltYWdlICA8PSAzKSAlPiUNCiAgZ2dwbG90KGFlcyh4LCB5LCBmaWxsID0gdmFsdWUpKSArIGdlb21fdGlsZSgpICsNCiAgZmFjZXRfd3JhcCh+IGltYWdlICsgZGlnaXQsIGxhYmVsbGVyID0gImxhYmVsX2JvdGgiKQ0KYGBgDQoNCg0KMS4gU3BsaXQgdGhlIG1uaXN0IGRhdGEgaW50byB0cmFpbmluZyBhbmQgdGVzdCBzZXRzICg3NSUtMjUlKS4gRml0IGEgIGZ1bGwgdHJlZSB0byB0aGUgdHJhaW5pbmcgZGF0YSwgYW5kIHVzZWQgY3Jvc3MtdmFsaWRhdGlvbiB0byBzZWxlY3QgdGhlIHNpbXBsZXN0IG1vZGVsIHdpdGhpbiBvbmUgc3RhbmRhcmQgZGV2aWF0aW9uIG9mIHRoZSBsb3dlc3QgZXJyb3IuIFJlcG9ydCB0aGUgY29uZnVzaW9uIG1hdHJpeCBhbmQgYWNjdXJhY3kgb2YgdGhlIHJlc3VsdGluZyBtb2RlbCBvbiB0aGUgdGVzdCBkYXRhLiANCg0KDQoyLiBOb3cgZml0IGEgKnJhbmRvbSBmb3Jlc3QqIHdpdGggMjUwIHRyZWVzIHRvIHRoZSB0cmFpbmluZyBkYXRhLCBhbmQgcmVwb3J0IHRoZSBjb25mdXNpb24gbWF0cml4IGFuZCBhY2N1cmFjeSBvbiB0aGUgdGVzdCBkYXRhLiBIb3cgZG9lcyB0aGUgbW9kZWwgY29tcGFyZSB0byB0aGUgcHJldmlvdXMgb25lPw0KDQoNCjMuIEJhc2VkIG9uIHlvdXIgY29uZnVzaW9uIG1hdHJpeCBmcm9tIHRoZSBwcmV2aW91cyBwYXJ0LCB3aGljaCBkaWdpdCBzZWVtcyB0aGUgbW9zdCBkaWZmaWN1bHQgdG8gY2xhc3NpZnkgY29ycmVjdGx5LiBGb3IgdGhhdCBkaWdpdCwgd2hhdCBvdGhlciBkaWdpdCBkb2VzIGl0IGdldCBtaXN0YWtlbiB3aXRoIG1vcmUgb2Z0ZW4/IA0KDQoNCg0KIyMjIyBDcmVkaXQgQ2FyZCBEYXRhDQoNClRoZXNlIGRhdGEgY29tZSBmcm9tIGEgVGFpd2FuZXNlIGNyZWRpdCBjYXJkIGNvbXBhbnkgKGZyb20gW1VDSSdzIE1MIGRhdGEgcmVwb10oaHR0cHM6Ly9hcmNoaXZlLmljcy51Y2kuZWR1L21sL2RhdGFzZXRzL2RlZmF1bHQrb2YrY3JlZGl0K2NhcmQrY2xpZW50cykpLiBUaGUgZ29hbCBpcyB0byBwcmVkaWN0IHdoaWNoIGNsaWVudHMgd2lsbCAqZGVmYXVsdCogb24gdGhlaXIgY3JlZGl0IGNhcmQgcGF5bWVudHMsIGJhc2VkIG9uIHRoZWlyIHByb2ZpbGUgYW5kIHBhc3QgcGF5bWVudHMuIA0KDQpgYGB7ciwgbWVzc2FnZT1GQUxTRX0NCmNjZCA9IHJlYWRfY3N2KCIuLi9kYXRhL2NyZWRpdF9jYXJkX2RlZmF1bHQuY3N2IikgJT4lIA0KICBtdXRhdGUoIGRlZmF1bHQgPSBmYWN0b3IoIGRlZmF1bHQgKSkNCmBgYA0KDQo0LiBGaW5kIHRoZSBwcm9wb3J0aW9uIG9mIGNsYXNzZXMgKGBkZWZhdWx0YCAwIG9yIDEpIGluIGRhdGEuIFNwbGl0IHRoZSBkYXRhIGludG8gdHJhaW5pbmcgYW5kIHRlc3Qgc2V0cyAoODAlLTIwJSByZXNwZWN0aXZlbHkpLCBhbmQgcmVwb3J0IHRoZSBwcm9wb3J0aW9uIG9mIGNsYXNzZXMgd2l0aGluIGVhY2ggc2V0Lg0KDQoNCjUuIEZpdCBhIGNsYXNzaWZpY2F0aW9uIHRyZWUgdG8gdGhlIHRyYWluaW5nIGRhdGEsIHVzaW5nIGBycGFydCgpYCB3aXRoIGNvbnRyb2xzIGBjcCA9IC4wMDEnIGFuZCBtaW5zcGxpdCA9IGAxYC4gQXBwbHkgdGhlIG1vZGVsIHRvIHRoZSB0ZXN0IHNldCBhbmQgcmVwb3J0IHRoZSBjb25mdXNpb24gbWF0cml4LiBDYWNsdWxhdGUgdGhlIHRlc3Qgc2V0IGFjY3VyYWN5LCAqUHJlY2lzaW9uKiwgKlJlY2FsbCosIGFuZCAqRi1tZWFzdXJlKiAod2hlcmUgcG9zaXRpdmUgaXMgZGVmYXVsdCkuDQoNCg0KNi4gQ3JlYXRlIGEgYmFsYW5jZWQgdmVyc2lvbiBvZiB0aGUgdHJhaW5pbmcgc2V0IHVzaW5nICp1bmRlcnNhbXBsaW5nKi4gS2VlcCBhbGwgdGhlIG1pbm9yaXR5IGNsYXNzIG9ic2VydmF0aW9ucyBpbiB0aGUgdHJhaW5pbmcgc2V0LCBhbmQgcmFuZG9tbHkgc2FtcGxlIGFuIGVxdWFsIG51bWJlciBvZiBtYWpvcml0eSBjbGFzcyBvYnNlcnZhdGlvbnM7IHlvdXIgcmVzdWx0aW5nIHRyYWluaW5nIHNldCBzaG91bGQgYmUgc21hbGxlciBidXQgcGVyZmVjdGx5IGJhbGFuY2VkLiBGaXQgYSBjbGFzc2lmaWNhdGlvbiB0cmVlIHRvIHRoaXMgbmV3IHRyYWluaW5nIHNldCBhbmQgcmVwb3J0IHRoZSBzYW1lIG1lYXN1cmVzIChjb25mdXNpb24gbWF0cml4LCBhY2N1cmFjeSwgcHJlY2lzaW9uIHJlY2FsbCwgYW5kIEYxIG1lYXN1cmUpIGZvciB0aGUgKnNhbWUqIHRlc3Qgc2V0LiANCg0KDQo3LiBIb3cgZG8gdGhlIHJlY2FsbHMgYW5kIHByZWNpc2lvbnMgb2YgdGhlIG9yaWdpbmFsIGFuZCB1bmRlcnNhbXBsZWQgbW9kZWxzIGNvbXBhcmU/IFdoaWNoIG1vZGVsIHdvdWxkIHlvdSB1c2UgaWYgeW91IHdlcmUgYSBiYW5rIG1hbmFnZXIgY2hvb3Npbmcgd2hpY2ggY2xpZW50cyB0byBnaXZlIGNyZWRpdCB0bz8gSnVzdGlmeSB5b3VyIGFuc3dlci4NCg0K