STAT 220
Supervised machine learning algorithm i.e., it requires labeled data for training
Need to tell the algorithm the exact number of neighbors (K) we want to consider
Training: Fitting a model with certain hyper-parameters on a particular subset of the dataset
Testing: Test the model on a different subset of the dataset to get an estimate of a final, unbiased assessment of the model’s performance

 
A machine learning workflow (the “black box”) containing model specification and preprocessing recipe/formula
| Variable | Description | 
|---|---|
Date | 
(DD-MM-YYYY) Day, month, year | 
Temp | 
Noon temperature in Celsius degrees: 22 to 42 | 
RH | 
Relative Humidity in percentage: 21 to 90 | 
Ws | 
Wind speed in km/h: 6 to 29 | 
Rain | 
Daily total rain in mm: 0 to 16.8 | 
Fine Fuel Moisture Code (FFMC) index | 
28.6 to 92.5 | 
Duff Moisture Code (DMC) index | 
1.1 to 65.9 | 
Drought Code (DC) index | 
7 to 220.4 | 
Initial Spread Index (ISI) index | 
0 to 18.5 | 
Buildup Index (BUI) index | 
1.1 to 68 | 
Fire Weather Index (FWI) index | 
0 to 31.1 | 
Classes | 
Two classes, namely .bold[fire] and .bold[not fire] | 
# A tibble: 61 × 3
   temperature   isi classes 
         <dbl> <dbl> <chr>   
 1          29   1   not fire
 2          26   0.3 not fire
 3          26   4.8 fire    
 4          28   0.4 not fire
 5          31   0.7 not fire
 6          31   2.5 not fire
 7          34   9.2 fire    
 8          32   7.6 fire    
 9          32   2.2 not fire
10          29   1.1 not fire
# ℹ 51 more rows
Fitted workflow
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_scale()
• step_center()
── Model ───────────────────────────────────────────────────────────────────────
Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(5,     data, 5), kernel = ~"rectangular")
Type of response variable: nominal
Minimal misclassification: 0.03296703
Best kernel: rectangular
Best k: 5

ca22-yourusername repository from Github10:00 How to choose the number of neighbors in a principled way?

We normally don’t have a clear separation between classes and usually have more than 2 features.
Eyeballing on a plot to discern the classes is not very helpful in the practical sense
We want to evaluate classifiers based on some accuracy metrics.
Randomly split data set into two pieces: training set and test set
Train (i.e. fit) KNN on the training set
Make predictions on the test set
See how good those predictions are


Confusion matrix: tabulation of true (i.e. expected) and predicted class labels
Proportion of correctly classified cases \[{\rm Accuracy} = \frac{\text{true positives} + \text{true negatives}}{n}\]
Proportion of positive cases that are predicted to be positive \[{\rm Sensitivity} = \frac{\text{true positives}}{ \text{true positives}+ \text{false negatives}}\] Also called… true positive rate or recall
Proportion of negative cases that are predicted to be negative \[{\rm Specificity} = \frac{\text{true negatives}}{ \text{false positives}+ \text{true negatives}}\] Also called… true negative rate
Proportion of cases that are predicted to be positives that are truly positives \[{\rm PPV} = \frac{\text{true positives}}{ \text{true positives} + \text{false positives}}\] Also called… precision

10:00 custom_metrics <- metric_set(accuracy, sens, spec, ppv) # select custom metrics
metrics <- custom_metrics(fire_results, truth = classes, estimate = predicted) 
metrics# A tibble: 4 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.934
2 sens     binary         0.910
3 spec     binary         0.964
4 ppv      binary         0.968

library(yardstick)
fire_prob <- predict(fire_knn_fit, test_features, type = "prob")
fire_results2 <- fire_test %>% select(classes) %>% bind_cols(fire_prob)
fire_results2 %>%
  roc_curve(truth = classes, .pred_fire) %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  geom_line(color = "#1f77b4", size = 1.2) +
  geom_abline(linetype = "dashed", color = "gray") +
  annotate("text", x = 0.8, y = 0.1, label = paste("AUC =", round(roc_auc(fire_results2, truth = classes, .pred_fire)$.estimate, 3)), hjust = 1, color = "#ff7f0e", size = 5, fontface = "bold") +
  labs(title = "ROC Curve", subtitle = "Performance of Fire Prediction Model", x = "False Positive Rate (1 - specificity)", y = "True Positive Rate (sensitivity)") +
  theme_minimal()