Tidy Data and Dates

STAT 220

Bastola

What are tidy data?

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each value has its own cell

Untidy data: example 1

untidy_data <- tibble(
  name = c("Ana","Bob","Cara"),
  wt_07_01_2021 = c(100, 150, 140),
  wt_08_01_2021 = c(104, 155, 138),
  wt_09_01_2021 = c(NA, 160, 142)
)
# A tibble: 3 × 4
  name  wt_07_01_2021 wt_08_01_2021 wt_09_01_2021
  <chr>         <dbl>         <dbl>         <dbl>
1 Ana             100           104            NA
2 Bob             150           155           160
3 Cara            140           138           142

Tidy data: example 1

library(tidyr)  # for tidying messy data
library(lubridate)  # for date manipulations
library(stringr) # for string manipulations
untidy_data %>% 
  pivot_longer(names_to = "date",  
               values_to = "weight", 
               cols = -name) %>%
  mutate(date = stringr::str_remove(date,"wt_"), 
         date = lubridate::dmy(date))    
# A tibble: 9 × 3
  name  date       weight
  <chr> <date>      <dbl>
1 Ana   2021-01-07    100
2 Ana   2021-01-08    104
3 Ana   2021-01-09     NA
4 Bob   2021-01-07    150
5 Bob   2021-01-08    155
6 Bob   2021-01-09    160
7 Cara  2021-01-07    140
8 Cara  2021-01-08    138
9 Cara  2021-01-09    142

Wide data

Wide data has one row per subject, with multiple columns for their repeated measurements

Wide
id SBP_visit1 SBP_visit2 SBP_visit3
a 132 100 111
b 120 106 112
c 130 126 128
d 118 116 108

Long data


Long data has multiple rows per subject, with one column for the measurement variable and another indicating from when/where the repeated measures are from

Long
id visit SBP
a 1 132
b 1 120
c 1 130
d 1 118
a 2 100
b 2 106
c 2 126
d 2 116
a 3 111
b 3 112
c 3 128
d 3 108

Wide to long: pivot_longer()

BP_wide
# A tibble: 4 × 5
  id    sex   SBP_v1 SBP_v2 SBP_v3
  <chr> <chr>  <dbl>  <dbl>  <dbl>
1 a     F        130    110    112
2 b     M        120    116    122
3 c     M        130    136    138
4 d     F        119    106    118
BP_long <- BP_wide %>%
  pivot_longer(names_to = "visit",  
               values_to = "SBP",  
               cols = SBP_v1:SBP_v3)   
BP_long
# A tibble: 12 × 4
   id    sex   visit    SBP
   <chr> <chr> <chr>  <dbl>
 1 a     F     SBP_v1   130
 2 a     F     SBP_v2   110
 3 a     F     SBP_v3   112
 4 b     M     SBP_v1   120
 5 b     M     SBP_v2   116
 6 b     M     SBP_v3   122
 7 c     M     SBP_v1   130
 8 c     M     SBP_v2   136
 9 c     M     SBP_v3   138
10 d     F     SBP_v1   119
11 d     F     SBP_v2   106
12 d     F     SBP_v3   118

Wide to long: pivot_longer()

pivot_longer lengthens data, increasing the number of rows and decreasing the number of columns.

Need to specify:

  • new column names

    • names_to: stores row names of wide data’s columns
    • .bvalues_to: stores data values
  • which columns to pivot

Long to wide: pivot_wider()

BP_long
# A tibble: 12 × 4
   id    sex   visit    SBP
   <chr> <chr> <chr>  <dbl>
 1 a     F     SBP_v1   130
 2 a     F     SBP_v2   110
 3 a     F     SBP_v3   112
 4 b     M     SBP_v1   120
 5 b     M     SBP_v2   116
 6 b     M     SBP_v3   122
 7 c     M     SBP_v1   130
 8 c     M     SBP_v2   136
 9 c     M     SBP_v3   138
10 d     F     SBP_v1   119
11 d     F     SBP_v2   106
12 d     F     SBP_v3   118
BP_wide1 <- BP_long %>% 
  pivot_wider(names_from = "visit", 
              values_from = "SBP") 
BP_wide1
# A tibble: 4 × 5
  id    sex   SBP_v1 SBP_v2 SBP_v3
  <chr> <chr>  <dbl>  <dbl>  <dbl>
1 a     F        130    110    112
2 b     M        120    116    122
3 c     M        130    136    138
4 d     F        119    106    118

Long to wide: pivot_wider()

pivot_wider increases number of columns and decreases the number of rows.

Need to specify:

  • new column names

    • names_from: get the name of the column
    • values_from: get the cell values from

Separate Info

BP_long
# A tibble: 12 × 4
   id    sex   visit    SBP
   <chr> <chr> <chr>  <dbl>
 1 a     F     SBP_v1   130
 2 a     F     SBP_v2   110
 3 a     F     SBP_v3   112
 4 b     M     SBP_v1   120
 5 b     M     SBP_v2   116
 6 b     M     SBP_v3   122
 7 c     M     SBP_v1   130
 8 c     M     SBP_v2   136
 9 c     M     SBP_v3   138
10 d     F     SBP_v1   119
11 d     F     SBP_v2   106
12 d     F     SBP_v3   118
BP_long1 <- BP_long %>%
  separate(visit, c("type", 
                    "visit")) 
BP_long1
# A tibble: 12 × 5
   id    sex   type  visit   SBP
   <chr> <chr> <chr> <chr> <dbl>
 1 a     F     SBP   v1      130
 2 a     F     SBP   v2      110
 3 a     F     SBP   v3      112
 4 b     M     SBP   v1      120
 5 b     M     SBP   v2      116
 6 b     M     SBP   v3      122
 7 c     M     SBP   v1      130
 8 c     M     SBP   v2      136
 9 c     M     SBP   v3      138
10 d     F     SBP   v1      119
11 d     F     SBP   v2      106
12 d     F     SBP   v3      118

readr::parse_number()

Goal: Extract first number from the visit column and remove the characters

BP_long
# A tibble: 12 × 4
   id    sex   visit    SBP
   <chr> <chr> <chr>  <dbl>
 1 a     F     SBP_v1   130
 2 a     F     SBP_v2   110
 3 a     F     SBP_v3   112
 4 b     M     SBP_v1   120
 5 b     M     SBP_v2   116
 6 b     M     SBP_v3   122
 7 c     M     SBP_v1   130
 8 c     M     SBP_v2   136
 9 c     M     SBP_v3   138
10 d     F     SBP_v1   119
11 d     F     SBP_v2   106
12 d     F     SBP_v3   118
BP_long2 <- BP_long %>% 
  mutate(visit = parse_number(visit)) 
BP_long2
# A tibble: 12 × 4
   id    sex   visit   SBP
   <chr> <chr> <dbl> <dbl>
 1 a     F         1   130
 2 a     F         2   110
 3 a     F         3   112
 4 b     M         1   120
 5 b     M         2   116
 6 b     M         3   122
 7 c     M         1   130
 8 c     M         2   136
 9 c     M         3   138
10 d     F         1   119
11 d     F         2   106
12 d     F         3   118

Make cleaned-up long data wide

head(BP_long2, 4)
# A tibble: 4 × 4
  id    sex   visit   SBP
  <chr> <chr> <dbl> <dbl>
1 a     F         1   130
2 a     F         2   110
3 a     F         3   112
4 b     M         1   120
BP_wide2 <- BP_long2 %>% 
  pivot_wider(names_from = "visit", 
              values_from = "SBP") 
BP_wide2
# A tibble: 4 × 5
  id    sex     `1`   `2`   `3`
  <chr> <chr> <dbl> <dbl> <dbl>
1 a     F       130   110   112
2 b     M       120   116   122
3 c     M       130   136   138
4 d     F       119   106   118
  • Problem: have numbers as column names
  • Solution: have row names start with the key column’s name separated by a character
BP_wide3 <- BP_long2 %>% 
  pivot_wider(names_from = "visit",     
              values_from = "SBP",      
              names_prefix = "value_")  
BP_wide3
# A tibble: 4 × 5
  id    sex   value_1 value_2 value_3
  <chr> <chr>   <dbl>   <dbl>   <dbl>
1 a     F         130     110     112
2 b     M         120     116     122
3 c     M         130     136     138
4 d     F         119     106     118

 Group Activity 1


  • Please clone the ca8-yourusername repository from Github
  • Please do the problem 1 in the class activity for today

10:00

Dates with lubridate

  • Convert characters to special “Date” type

  • Easy date magic examples:

    • add and subtract dates
    • convert to minutes/years/etc
    • change timezones
    • add 1 month to a date…

What kind of date do you have?

Parsing Complex Dates

library(lubridate)
complex_dates <- tibble(
  name = c("Yi", "Mo", "Dee"),
  dob = c("31-October-1952 14:30:15",
          "12-Jan-1984 22:15:00",
          "02-Feb-2002 10:45:30")
)

complex_dates %>%
  mutate(dob_date = dmy_hms(dob)) %>% 
  knitr::kable()  # to make nice tables
name dob dob_date
Yi 31-October-1952 14:30:15 1952-10-31 14:30:15
Mo 12-Jan-1984 22:15:00 1984-01-12 22:15:00
Dee 02-Feb-2002 10:45:30 2002-02-02 10:45:30

Advanced Date Arithmetic

complex_dates %>%
  mutate( dob_date = dmy_hms(dob),
          dob_year = year(dob_date),
          time_since_birth = interval(dob_date, now()),
          age_exact = time_length(time_since_birth, unit = "years")) %>%
  knitr::kable()
name dob dob_date dob_year time_since_birth age_exact
Yi 31-October-1952 14:30:15 1952-10-31 14:30:15 1952 1952-10-31 14:30:15 UTC–2024-05-15 03:37:38 UTC 71.53701
Mo 12-Jan-1984 22:15:00 1984-01-12 22:15:00 1984 1984-01-12 22:15:00 UTC–2024-05-15 03:37:38 UTC 40.33668
Dee 02-Feb-2002 10:45:30 2002-02-02 10:45:30 2002 2002-02-02 10:45:30 UTC–2024-05-15 03:37:38 UTC 22.28061

Advanced Date Arithmetic: alternate

complex_dates %>%
  mutate(dob_date = dmy_hms(dob),
        age_exact = as.numeric(interval(dob_date, now()) / years(1))) %>% 
  knitr::kable()
name dob dob_date age_exact
Yi 31-October-1952 14:30:15 1952-10-31 14:30:15 71.53701
Mo 12-Jan-1984 22:15:00 1984-01-12 22:15:00 40.33668
Dee 02-Feb-2002 10:45:30 2002-02-02 10:45:30 22.28061

Creating Date-Time Objects with make_datetime()

# Components of date and time
year_component <- 2024
month_component <- 1
day_component <- 19
hour_component <- 9
minute_component <- 03
second_component <- 30
# Create a date-time object
date_time_object <- make_datetime(
  year = year_component,
  month = month_component,
  day = day_component,
  hour = hour_component,
  min = minute_component,
  sec = second_component
)
date_time_object 
[1] "2024-01-19 09:03:30 UTC"

Advanced Date-Time Arithmetic

date_time <- tibble(date = ymd("2024-01-13", "2024-03-06", "2025-06-30"))

date_time %>%
  mutate(add_month = date %m+% months(1), 
         add_year = date %m+% years(1),
         subtract_week = date %m-% weeks(1)) %>% 
  knitr::kable()
date add_month add_year subtract_week
2024-01-13 2024-02-13 2025-01-13 2024-01-06
2024-03-06 2024-04-06 2025-03-06 2024-02-28
2025-06-30 2025-07-30 2026-06-30 2025-06-23

Duration and Time Differences

time_diff_example <- tibble(
  start_date = ymd("2024-01-03"),
  end_date = ymd("2024-03-17")
)

time_diff_example %>%
  mutate(time_diff = end_date - start_date,
         duration_days = time_length(time_diff, unit = "days"),
         duration_weeks = as.duration(time_diff) / dweeks(1),
         duration_months = as.duration(time_diff) / dmonths(1),
         duration_years = as.duration(time_diff) / dyears(1))  %>% 
  knitr::kable()
start_date end_date time_diff duration_days duration_weeks duration_months duration_years
2024-01-03 2024-03-17 74 days 74 10.57143 2.431211 0.202601

 Group Activity 2


  • Please do the remaining problems in the class activity.
  • Submit to Gradescope on moodle when done!

10:00