STAT 220
Predicting what category a (future) observation falls into
Fire can be deadly, destroying homes, wildlife habitat and timber, and polluting the air with harmful emissions.
Dataset
Click here to learn more about the dataset
Variable | Description |
---|---|
Date |
(DD-MM-YYYY) Day, month, year |
Temp |
Noon temperature in Celsius degrees: 22 to 42 |
RH |
Relative Humidity in percentage: 21 to 90 |
Ws |
Wind speed in km/h: 6 to 29 |
Rain |
Daily total rain in mm: 0 to 16.8 |
Fine Fuel Moisture Code (FFMC) index |
28.6 to 92.5 |
Duff Moisture Code (DMC) index |
1.1 to 65.9 |
Drought Code (DC) index |
7 to 220.4 |
Initial Spread Index (ISI) index |
0 to 18.5 |
Buildup Index (BUI) index |
1.1 to 68 |
Fire Weather Index (FWI) index |
0 to 31.1 |
Classes |
Two classes, namely fire and not fire |
date | temperature | rh | ws | rain | ffmc | dmc | dc | isi | bui | fwi | classes |
---|---|---|---|---|---|---|---|---|---|---|---|
2012-06-01 | 29 | 57 | 18 | 0.0 | 65.7 | 3.4 | 7.6 | 1.3 | 3.4 | 0.5 | not fire |
2012-06-02 | 29 | 61 | 13 | 1.3 | 64.4 | 4.1 | 7.6 | 1.0 | 3.9 | 0.4 | not fire |
2012-06-03 | 26 | 82 | 22 | 13.1 | 47.1 | 2.5 | 7.1 | 0.3 | 2.7 | 0.1 | not fire |
2012-06-04 | 25 | 89 | 13 | 2.5 | 28.6 | 1.3 | 6.9 | 0.0 | 1.7 | 0.0 | not fire |
2012-06-05 | 27 | 77 | 16 | 0.0 | 64.8 | 3.0 | 14.2 | 1.2 | 3.9 | 0.5 | not fire |
2012-06-06 | 31 | 67 | 14 | 0.0 | 82.6 | 5.8 | 22.2 | 3.1 | 7.0 | 2.5 | fire |
2012-06-07 | 33 | 54 | 13 | 0.0 | 88.2 | 9.9 | 30.5 | 6.4 | 10.9 | 7.2 | fire |
2012-06-08 | 30 | 73 | 15 | 0.0 | 86.6 | 12.1 | 38.3 | 5.6 | 13.5 | 7.1 | fire |
2012-06-09 | 25 | 88 | 13 | 0.2 | 52.9 | 7.9 | 38.8 | 0.4 | 10.5 | 0.3 | not fire |
2012-06-10 | 28 | 79 | 12 | 0.0 | 73.2 | 9.5 | 46.3 | 1.3 | 12.6 | 0.9 | not fire |
Euclidean distance: the straight line distance between two points on the x-y plane with coordinates
\((x_a, y_a)\) and \((x_b,y_b)\)
\[{\rm Distance} = \sqrt{\left(x_a - x_b \right)^2 + \left( y_a - y_b \right)^2}\]
Manhattan distance: the “taxi-cab” distance between two points on the x-y plane
\[{\rm Distance} = \left|x_a - x_b \right| + \left| y_a - y_b \right|\]
Wait, something is not quite right..
fire1 <- fire %>% mutate(across(where(is.numeric), standardize))
fire1 %>% head() %>% knitr::kable()
date | temperature | rh | ws | rain | ffmc | dmc | dc | isi | bui | fwi | classes |
---|---|---|---|---|---|---|---|---|---|---|---|
2012-06-01 | -0.8688614 | -0.3399715 | 0.8914370 | -0.3808708 | -0.8461805 | -0.9102414 | -0.8775901 | -0.8286454 | -0.9340836 | -0.8783457 | not fire |
2012-06-02 | -0.8688614 | -0.0702145 | -0.8870457 | 0.2680887 | -0.9367751 | -0.8537581 | -0.8775901 | -0.9008609 | -0.8989427 | -0.8917856 | not fire |
2012-06-03 | -1.6957543 | 1.3460097 | 2.3142231 | 6.1586438 | -2.1423802 | -0.9828629 | -0.8880798 | -1.0693637 | -0.9832809 | -0.9321051 | not fire |
2012-06-04 | -1.9713852 | 1.8180845 | -0.8870457 | 0.8671282 | -3.4316110 | -1.0796914 | -0.8922757 | -1.1415792 | -1.0535628 | -0.9455449 | not fire |
2012-06-05 | -1.4201233 | 1.0088135 | 0.1800439 | -0.3808708 | -0.9088999 | -0.9425176 | -0.7391255 | -0.8527172 | -0.8989427 | -0.8783457 | not fire |
2012-06-06 | -0.3175995 | 0.3344210 | -0.5313491 | -0.3808708 | 0.3315493 | -0.7165844 | -0.5712896 | -0.3953525 | -0.6810689 | -0.6095490 | fire |
We can map out the region in feature-space where the classifier would predict ‘fire’, and the kinds where it would predict ‘not fire’
There is some boundary between the two, where points on one side of the boundary will be classified ‘fire’ and points on the other side will be classified ‘not fire’
This boundary is called decision boundary
a collection of packages for modeling and machine learning using tidyverse principles
fire_raw <- read_csv("https://raw.githubusercontent.com/deepbas/statdatasets/main/Algeriafires.csv") %>%
janitor::clean_names() %>% tidyr::drop_na() %>%
mutate(classes = factor(classes)) %>%
mutate_at(c(10,13), as.numeric) %>%
select(temperature, ffmc, classes)
head(fire_raw)
# A tibble: 6 × 3
temperature ffmc classes
<dbl> <dbl> <fct>
1 29 65.7 not fire
2 29 64.4 not fire
3 26 47.1 not fire
4 25 28.6 not fire
5 27 64.8 not fire
6 31 82.6 fire
# A tibble: 243 × 3
temperature ffmc classes
<dbl> <dbl> <fct>
1 -0.869 -0.846 not fire
2 -0.869 -0.937 not fire
3 -1.70 -2.14 not fire
4 -1.97 -3.43 not fire
5 -1.42 -0.909 not fire
6 -0.318 0.332 fire
7 0.234 0.722 fire
8 -0.593 0.610 fire
9 -1.97 -1.74 not fire
10 -1.14 -0.324 not fire
# ℹ 233 more rows
Suppose we get two new observations, use predict to classify the observations
diabetes
= pos
, neg
Variable | Description |
---|---|
pregnant |
Number of times pregnant |
glucose |
Plasma glucose concentration (glucose tolerance test) |
pressure |
Diastolic blood pressure (mm Hg) |
triceps |
Triceps skinfold thickness (mm) |
insulin |
2-Hour serum insulin (mu U/ml) |
mass |
Body mass index (weight in kg/(height in m)\²) |
pedigree |
Diabetes pedigree function |
age |
Age (years) |
diabetes |
diabetes case (pos/neg) |
ca21-yourusername
repository from Github10:00