Lecture Goals

  • Perform binary classification w/ thresholding
  • Measure classifier performance using
    • Confusion matrix
    • ROC curves
  • Readings:

Wisconsin Diagnostic Breast Cancer (WDBC) Data

  • Fine-Needle Aspiration (FNA) data on 569 patients
  • FNA biopsy withdraws small amount of tissue/fluid from suspicious area
    • Biopsy sample checked for cancer cells

WDBC Data

  • Calculate tumor cell features based on FNA images

WDBC Features

  • For each cell, calculate
    1. radius (mean distance from center to perimeter)
    2. texture (standard deviation of gray-scale values)
    3. perimeter
    4. area
    5. smoothness (local variation in radius lengths)
    6. compactness (perimeter^2 / area - 1.0)
    7. concavity (severity of concave portions of contour)
    8. concave points (number of concave portions of contour)
    9. symmetry
    10. fractal dimension

WDBC Features

  • For each image (group of cells) report
    1. mean (.m)
    2. standard deviation (.se)
    3. worst value (.w)
  • In total, 3 x 10 = 30 features

WDBC Data

wdbc = read_csv("data/wdbc.csv")
glimpse(wdbc)
## Observations: 569
## Variables: 32
## $ id             <dbl> 842302, 842517, 84300903, 84348301, 84358402, 8...
## $ diagnosis      <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M...
## $ radius.m       <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450,...
## $ radius.se      <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.98...
## $ radius.w       <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, 1...
## $ texture.m      <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, 1...
## $ texture.se     <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0....
## $ texture.w      <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0....
## $ perimeter.m    <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0....
## $ perimeter.se   <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0....
## $ perimeter.w    <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087,...
## $ area.m         <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0....
## $ area.se        <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345,...
## $ area.w         <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902,...
## $ smoothness.m   <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.180...
## $ smoothness.se  <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.9...
## $ smoothness.w   <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.01149...
## $ compactness.m  <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.02461...
## $ compactness.se <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0....
## $ compactness.w  <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.01885...
## $ concavity.m    <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0....
## $ concavity.se   <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.00511...
## $ concavity.w    <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.88...
## $ conc.points.m  <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.66...
## $ conc.points.se <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40, ...
## $ conc.points.w  <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, 1...
## $ symetry.m      <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791,...
## $ symetry.se     <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249,...
## $ symetry.w      <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0....
## $ fract.dim.m    <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0....
## $ fract.dim.se   <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985,...
## $ fract.dim.w    <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0....

Classification

  • Create system that predicts label \(Y\) based on feature variables \(X_1, \ldots, X_p\)
    • Called binary classification when \(Y\) takes only 2 values
  • Find function \(f(\cdot)\) such that \(f(X_1,...,X_p) = \hat{Y} \sim Y\)

  • Several methods to arrive at \(f(\cdot)\)
    • some work better than others on certain problems

Threshold Classification

  • Mean cell smoothness is indicative of cancer
wdbc %>% ggplot(aes(x = smoothness.m, fill = diagnosis )) + 
  geom_histogram(position = "dodge", bins=30) + 
  geom_vline(xintercept = 3) + scale_x_log10()

Classification Accuracy

  • Accuracy: proportion of correct predictions
    • Does not differentiate classes (all equally important)
  • Compare to naive majority classifier
    • For WDBC, predict benign
wdbc %>% mutate( naive = "B",
  predicted = ifelse( smoothness.m > 3, "M", "B")) %>%  
  summarise( acc.pred = mean(predicted == diagnosis),
             acc.naiv = mean(naive == diagnosis) )
## # A tibble: 1 x 2
##   acc.pred acc.naiv
##      <dbl>    <dbl>
## 1    0.805    0.627

Confusion Matrix

  • Confusion matrix is more nuanced performance measure
    • Assuming class of interest is positive
Actual Positive Actual Negative Sum
Predict Positive True Positive (\(TP\)) False Positive (\(FP\)) \(PP = TP+FP\)
Predict Negative False Negative (\(FN\)) True Negative (\(TN\)) \(PN = FN + TN\)
Sum \(P = TP + FN\) \(N = FP+TN\)


- What would be Type I/II Error in hypothesis testing?

Confusion Matrix

Example

wdbc %>%  
  mutate( predicted = ifelse( smoothness.m > 3, "M", "B") ) %>% 
  mutate( predicted = fct_relevel(predicted, "M"),
          diagnosis = fct_relevel(diagnosis, "M") ) %>% 
  xtabs( ~ predicted + diagnosis, data = .) %>% addmargins()
##          diagnosis
## predicted   M   B Sum
##       M   140  39 179
##       B    72 318 390
##       Sum 212 357 569

Classifier Performance

  • Sensitivity/Recall/True Positive Rate (TPR): \(TP/P\)
  • Precision/Positive Predictive Value (PPV): \(TP/PP\)
  • Specificity/True Negative Rate (TNR): \(TN/N\)
  • False Positive Rate (FPR): = \(FP / N = 1-TNR\)

  • F1-measure: \(2\frac{ Sens. \times Prec.}{Sens. + Prec.}\)
    • Closer to 1 is better

ROC curve

  • ROC curve: plot of FPR vs TPR for all possible thresholds (configurations) of binary classifier

Example

library(pROC)
ROC_out = roc(diagnosis ~ smoothness.m,  data = wdbc)
ggroc(ROC_out)  

ROC curve

  • Classifiers with ROC curve above and to the left are better
    • Compare classifiers irrespective of threshold
    • Difficult to compare curves that cross
  • Area under curve (AUC) used as a proxy for classifier performance
auc(diagnosis ~ smoothness.m,  data = wdbc) # auc(ROC_out)
## Area under the curve: 0.8764