Intro to Classification

Predicting what category a (future) observation falls into

  • Astronomy: Whether an exoplanet is habitable (or not)
  • Filtering: Identify spam emails
  • Medicine: Use lab results to determine who has a disease (or not)
  • Product preference: make product recommendations based on past purchases

Fire can be deadly, destroying homes, wildlife habitat and timber, and polluting the air with harmful emissions.

Predicting the next forest fire ..


  • contains a culmination of forest fire observations
  • based on two regions of Algeria: the Bejaia region and the Sidi Bel-Abbes region.
  • from June 2012 to September 2012

Data Description

Variable Description
Date (DD-MM-YYYY) Day, month, year
Temp Noon temperature in Celsius degrees: 22 to 42
RH Relative Humidity in percentage: 21 to 90
Ws Wind speed in km/h: 6 to 29
Rain Daily total rain in mm: 0 to 16.8
Fine Fuel Moisture Code (FFMC) index 28.6 to 92.5
Duff Moisture Code (DMC) index 1.1 to 65.9
Drought Code (DC) index 7 to 220.4
Initial Spread Index (ISI) index 0 to 18.5
Buildup Index (BUI) index 1.1 to 68
Fire Weather Index (FWI) index 0 to 31.1
Classes Two classes, namely fire and not fire

Glimpse of the data

head(fire, 10) %>% knitr::kable()
date temperature rh ws rain ffmc dmc dc isi bui fwi classes
2012-06-01 29 57 18 0.0 65.7 3.4 7.6 1.3 3.4 0.5 not fire
2012-06-02 29 61 13 1.3 64.4 4.1 7.6 1.0 3.9 0.4 not fire
2012-06-03 26 82 22 13.1 47.1 2.5 7.1 0.3 2.7 0.1 not fire
2012-06-04 25 89 13 2.5 28.6 1.3 6.9 0.0 1.7 0.0 not fire
2012-06-05 27 77 16 0.0 64.8 3.0 14.2 1.2 3.9 0.5 not fire
2012-06-06 31 67 14 0.0 82.6 5.8 22.2 3.1 7.0 2.5 fire
2012-06-07 33 54 13 0.0 88.2 9.9 30.5 6.4 10.9 7.2 fire
2012-06-08 30 73 15 0.0 86.6 12.1 38.3 5.6 13.5 7.1 fire
2012-06-09 25 88 13 0.2 52.9 7.9 38.8 0.4 10.5 0.3 not fire
2012-06-10 28 79 12 0.0 73.2 9.5 46.3 1.3 12.6 0.9 not fire


Classifying a new observation?

Euclidean distance: the straight line distance between two points on the x-y plane with coordinates
\((x_a, y_a)\) and \((x_b,y_b)\)

\[{\rm Distance} = \sqrt{\left(x_a - x_b \right)^2 + \left( y_a - y_b \right)^2}\]

Manhattan distance: the “taxi-cab” distance between two points on the x-y plane

\[{\rm Distance} = \left|x_a - x_b \right| + \left| y_a - y_b \right|\]

Looking at Euclidean distance

1-Nearest Neighbor (NN)



Wait, something is not quite right..

Need to standardize data

standardize <- function(x, na.rm = FALSE) {
  (x - mean(x, na.rm = na.rm)) / 
    sd(x, na.rm = na.rm)
fire %>% select(ffmc, temperature,) %>%  
  map_df(.f = ~sd(.))
# A tibble: 1 × 2
   ffmc temperature
  <dbl>       <dbl>
1  14.3        3.63
  • Predictors with larger variation will have larger influence on which cases are “nearest” neighbors
  • Methods relying on distance can be sensitive (i.e. not invariant) to the scale of the predictors
  • Standardizing only shifts and rescales the variable, it doesn’t change the shape of the distribution

Standardized data

fire1 <- fire %>% mutate(across(where(is.numeric), standardize))
fire1 %>% head() %>% knitr::kable()
date temperature rh ws rain ffmc dmc dc isi bui fwi classes
2012-06-01 -0.8688614 -0.3399715 0.8914370 -0.3808708 -0.8461805 -0.9102414 -0.8775901 -0.8286454 -0.9340836 -0.8783457 not fire
2012-06-02 -0.8688614 -0.0702145 -0.8870457 0.2680887 -0.9367751 -0.8537581 -0.8775901 -0.9008609 -0.8989427 -0.8917856 not fire
2012-06-03 -1.6957543 1.3460097 2.3142231 6.1586438 -2.1423802 -0.9828629 -0.8880798 -1.0693637 -0.9832809 -0.9321051 not fire
2012-06-04 -1.9713852 1.8180845 -0.8870457 0.8671282 -3.4316110 -1.0796914 -0.8922757 -1.1415792 -1.0535628 -0.9455449 not fire
2012-06-05 -1.4201233 1.0088135 0.1800439 -0.3808708 -0.9088999 -0.9425176 -0.7391255 -0.8527172 -0.8989427 -0.8783457 not fire
2012-06-06 -0.3175995 0.3344210 -0.5313491 -0.3808708 0.3315493 -0.7165844 -0.5712896 -0.3953525 -0.6810689 -0.6095490 fire

1-NN again

10-NN again

50-NN again

Visualizing the decision boundary

  • We can map out the region in feature-space where the classifier would predict ‘fire’, and the kinds where it would predict ‘not fire’

  • There is some boundary between the two, where points on one side of the boundary will be classified ‘fire’ and points on the other side will be classified ‘not fire’

  • This boundary is called decision boundary

Visualizing the decision boundary

1-NN decision boundary

25-NN decision boundary

a collection of packages for modeling and machine learning using tidyverse principles

1. Load data and convert to correct data types

fire_raw <- read_csv("") %>% 
  janitor::clean_names() %>% tidyr::drop_na() %>% 
  mutate(classes = factor(classes)) %>%
  mutate_at(c(10,13), as.numeric) %>%
  select(temperature, ffmc, classes)
# A tibble: 6 × 3
  temperature  ffmc classes 
        <dbl> <dbl> <fct>   
1          29  65.7 not fire
2          29  64.4 not fire
3          26  47.1 not fire
4          25  28.6 not fire
5          27  64.8 not fire
6          31  82.6 fire    

2. Create a recipe for data preprocessing

fire_recipe <- recipe(classes ~ ., data = fire_raw) %>%
 step_scale(all_predictors()) %>%
 step_center(all_predictors()) %>%

3. Apply the recipe to the data set

fire_scaled <- bake(fire_recipe, fire_raw)
# A tibble: 243 × 3
   temperature   ffmc classes 
         <dbl>  <dbl> <fct>   
 1      -0.869 -0.846 not fire
 2      -0.869 -0.937 not fire
 3      -1.70  -2.14  not fire
 4      -1.97  -3.43  not fire
 5      -1.42  -0.909 not fire
 6      -0.318  0.332 fire    
 7       0.234  0.722 fire    
 8      -0.593  0.610 fire    
 9      -1.97  -1.74  not fire
10      -1.14  -0.324 not fire
# ℹ 233 more rows

4. Create a model specification

knn_spec <- nearest_neighbor(mode = "classification",
                             engine = "kknn",
                             weight_func = "rectangular",
                             neighbors = 5) 

5. Fit the model on the preprocessed data

knn_fit <- knn_spec %>%
 fit(classes ~ ., data = fire_scaled)

6. Classify

Suppose we get two new observations, use predict to classify the observations

# Data frame/tibble of new observations
new_observations <- tibble(temperature = c(1, 2), ffmc = c(-1, 1))
# Making classifications (i.e. predictions)
predict(knn_fit, new_data = new_observations)
# A tibble: 2 × 1
1 not fire   
2 fire       

Further Practice: Pima Indians Diabetes

  • Owned by the National Institute of Diabetes and Digestive and Kidney Diseases
  • A data frame with 768 observations on 9 variables.
  • We have the lab results of 158 patients, including whether they have CKD
  • Response variable: diabetes = pos, neg
  • Predictor variables: pregnant, glucose, pressure, triceps, insulin, mass, pedigree, age


Variable Description
pregnant Number of times pregnant
glucose Plasma glucose concentration (glucose tolerance test)
pressure Diastolic blood pressure (mm Hg)
triceps Triceps skinfold thickness (mm)
insulin 2-Hour serum insulin (mu U/ml)
mass Body mass index (weight in kg/(height in m)\²)
pedigree Diabetes pedigree function
age Age (years)
diabetes diabetes case (pos/neg)

