STAT 220
 Create a series of data sets similar to the training/testing split, always used with the training set
Idea: Split the training data up into multiple training-validation pairs, evaluate the classifier on each split and average the performance metrics
Image courtesy of Dennis Sun
split the data into \(k\) subsets
combine the first \(k-1\) subsets into a training set and train the classifier
evaluate the model predictions on the last (i.e. \(k\)th) held-out subset
repeat steps 2-3 \(k\) times (i.e. \(k\) “folds”), each time holding out a different one of the \(k\) subsets
calculate performance metrics from each validation set
average each metric over the \(k\) folds to come up with a single estimate of that metric
Create your model specification and use tune() as a placeholder for the number of neighbors
Split the fire_train data set into v = 5 folds, stratified by classes
Create a grid of \(K\) values, the number of neighbors
Run 5-fold CV on the k_vals grid, storing four performance metrics
Collect the performance metrics
# A tibble: 6 × 7
  neighbors .metric     .estimator  mean     n std_err .config              
      <dbl> <chr>       <chr>      <dbl> <int>   <dbl> <chr>                
1         1 accuracy    binary     0.987    50 0.00343 Preprocessor1_Model01
2         1 ppv         binary     0.999    50 0.00133 Preprocessor1_Model01
3         1 sensitivity binary     0.979    50 0.00586 Preprocessor1_Model01
4         1 specificity binary     0.998    50 0.002   Preprocessor1_Model01
5         2 accuracy    binary     0.987    50 0.00343 Preprocessor1_Model02
6         2 ppv         binary     0.999    50 0.00133 Preprocessor1_Model02
Collect the performance metrics and find the best model
# A tibble: 20 × 7
# Groups:   .metric [4]
   neighbors .metric     .estimator  mean     n std_err .config              
       <dbl> <chr>       <chr>      <dbl> <int>   <dbl> <chr>                
 1         1 accuracy    binary     0.987    50 0.00343 Preprocessor1_Model01
 2         2 accuracy    binary     0.987    50 0.00343 Preprocessor1_Model02
 3        11 ppv         binary     1        50 0       Preprocessor1_Model11
 4        12 ppv         binary     1        50 0       Preprocessor1_Model12
 5        13 ppv         binary     1        50 0       Preprocessor1_Model13
 6        14 ppv         binary     1        50 0       Preprocessor1_Model14
 7        15 ppv         binary     1        50 0       Preprocessor1_Model15
 8        16 ppv         binary     1        50 0       Preprocessor1_Model16
 9        17 ppv         binary     1        50 0       Preprocessor1_Model17
10        18 ppv         binary     1        50 0       Preprocessor1_Model18
11         1 sensitivity binary     0.979    50 0.00586 Preprocessor1_Model01
12         2 sensitivity binary     0.979    50 0.00586 Preprocessor1_Model02
13        11 specificity binary     1        50 0       Preprocessor1_Model11
14        12 specificity binary     1        50 0       Preprocessor1_Model12
15        13 specificity binary     1        50 0       Preprocessor1_Model13
16        14 specificity binary     1        50 0       Preprocessor1_Model14
17        15 specificity binary     1        50 0       Preprocessor1_Model15
18        16 specificity binary     1        50 0       Preprocessor1_Model16
19        17 specificity binary     1        50 0       Preprocessor1_Model17
20        18 specificity binary     1        50 0       Preprocessor1_Model18
Image source: rafalab.github.io/dsbook/

ca23-yourusername repository from Github10:00 inputoutput pairs.Linear regression fits a linear equation to observed data to describe the relationship between variables.
\[y=\beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \epsilon\]
Objective: Minimize the differences between the observed values and the values predicted by the linear equation.
\[\text{MSE} =\frac{1}{n} \sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2\]
\[\text{RMSE} =\sqrt{\frac{1}{n} \sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2}\]
\[R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \overline{y})^2}\]
Data from a real study on a bicycle sharing scheme was collected to predict rental numbers based on seasonality and weather conditions.
| season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | rentals | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 1 | 0 | 6 | 0 | 2 | 0.344167 | 0.363625 | 0.805833 | 0.160446 | 331 | 
| 1 | 0 | 1 | 0 | 0 | 0 | 2 | 0.363478 | 0.353739 | 0.696087 | 0.248539 | 131 | 
| 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.196364 | 0.189405 | 0.437273 | 0.248309 | 120 | 
| 1 | 0 | 1 | 0 | 2 | 1 | 1 | 0.200000 | 0.212122 | 0.590435 | 0.160296 | 108 | 
| 1 | 0 | 1 | 0 | 3 | 1 | 1 | 0.226957 | 0.229270 | 0.436957 | 0.186900 | 82 | 
| Variable | Description | 
|---|---|
| instant | An identifier for each unique row | 
| season | Encoded numerical value for the season (1 for spring, 2 for summer, 3 for fall, 4 for winter) | 
| yr | Observation year in the study, spanning two years (0 for 2011, 1 for 2012) | 
| mnth | Month of observation, numbered from 1 (January) to 12 (December) | 
| holiday | Indicates if the observation was on a public holiday (binary value) | 
| weekday | Day of the week of the observation (0 for Sunday to 6 for Saturday) | 
| workingday | Indicates if the day was a working day (binary value, excluding weekends and holidays) | 
| weathersit | Weather condition category (1 for clear, 2 for mist/cloud, 3 for light rain/snow, 4 for heavy rain/hail/snow/fog) | 
| temp | Normalized temperature in Celsius | 
| atemp | Normalized “feels-like” temperature in Celsius | 
| hum | Normalized humidity level | 
| windspeed | Normalized wind speed | 
| rentals | Count of bicycle rentals recorded | 
| variable | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | 
|---|---|---|---|---|---|---|
| atemp | 0.079070 | 0.337842 | 0.486733 | 0.474354 | 0.608602 | 0.840896 | 
| temp | 0.059130 | 0.337083 | 0.498333 | 0.495385 | 0.655417 | 0.861667 | 
| hum | 0.000000 | 0.520000 | 0.626667 | 0.627894 | 0.730209 | 0.972500 | 
| windspeed | 0.022392 | 0.134950 | 0.180975 | 0.190486 | 0.233214 | 0.507463 | 
| rentals | 2.000000 | 315.500000 | 713.000000 | 848.176471 | 1096.000000 | 3410.000000 | 
set.seed(2056)
bike_select <- bike %>% 
  select(c(season, mnth, holiday, weekday, workingday, weathersit,
           temp, atemp, hum, windspeed, rentals)) %>% 
    mutate(across(1:6, factor))
bike_split <- bike_select %>% 
  initial_split(prop = 0.75, strata = holiday) 
bike_train <- training(bike_split)
bike_test <- testing(bike_split)Model specification
Fit the model

10:00