Class Activity 24

# load the necessary libraries
library(tidyverse) 
library(ggthemes)
library(janitor)
library(broom)
library(mlbench)
library(tidymodels)
library(probably)

select <- dplyr::select
theme_set(theme_stata(base_size = 10))

data(PimaIndiansDiabetes2)
db <- PimaIndiansDiabetes2
db <- db %>% drop_na()  %>% 
  mutate(diabetes = fct_relevel(diabetes, c("neg", "pos"))) # Relevels 'diabetes' factor to ensure 'neg' comes before 'pos'

Group Activity 1

In this activity, we will calculate the probability of diabetes for a glucose level of 150 mg/dL using the logistic regression coefficients \(\beta_0 = -5.61\) and \(\beta_1 = 0.0392\).

a. Calculate Log Odds

First, calculate the log odds for a glucose level of 150 mg/dL.

Click for answer

log_odds <- -5.61 + (0.0392 * 150)
log_odds

[1] 0.27

b. Convert Log Odds to Odds

Click for answer

odds <- exp(log_odds)
odds

[1] 1.309964

c. Convert Odds to Probability

Click for answer

Finally, convert the odds to probability.

probability <- odds / (1 + odds)
probability

[1] 0.5670929

The probability of having diabetes at a glucose level of 150 mg/dL is calculated to be 0.5670929.

Group Activity 2

Let’s fit the logistic regression model.

set.seed(12345)
db_single <- db %>% select(diabetes, glucose)
db_split <- initial_split(db_single, prop = 0.80)

# Create training data
db_train <- db_split %>% training()

# Create testing data
db_test <- db_split %>% testing()


fitted_logistic_model <- logistic_reg() %>% # Call the model function
        # Set the engine/family of the model
        set_engine("glm") %>%
        # Set the mode
        set_mode("classification") %>%
        # Fit the model
        fit(diabetes~., data = db_train)

tidy(fitted_logistic_model)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -5.61     0.678       -8.28 1.20e-16
2 glucose       0.0392   0.00514      7.62 2.55e-14

We are interested in predicting the diabetes status of patients depending on the amount of glucose. Verify that the glucose value of 143.11 gives the probability of having diabetes as 1/2.

What value of glucose is needed to have a probability of diabetes of 0.5?

Make a classifier that classifies the diabetes status of new patients with a threshold of 0.5, i.e, a new patient is classified as negative if the estimated class probability is less than 0.75. Also, create a confusion matrix of the resulting predictions. Evaluate the model based on accuracy, sensitivity, specificity, and ppv.

Generate a ROC Curve and Determine the Optimal Threshold: Evaluate the performance of your diabetes prediction model by plotting a ROC curve. Use the curve to identify the point that is closest to the top-left corner (maximizing sensitivity and minimizing 1 - specificity), and back-calculate to find the corresponding optimal threshold. This threshold represents the best balance between sensitivity (true positive rate) and specificity (false positive rate).