Data Imports

STAT 220

Bastola

Let’s start with a messy data!

bar <- tibble(`First Name` = c("Yi", "Do", NA),
              `Last initials` = c("C", "R", ""),
              `% in` = c(0.1, 0.5, NA),
              `ñ$$$` = c(1, 2, NA),
              ` ` = c(3, 2, 1),
              ` hi` = c("a", "b", NA),
              `null` = c(NA, NA, "NULL"))

bar %>% knitr::kable()
First Name Last initials % in ñ$$$ hi null
Yi C 0.1 1 3 a NA
Do R 0.5 2 2 b NA
NA NA NA 1 NA NULL

Clean messy column names

library(janitor)
cleaned_bar <- bar %>%
  janitor::clean_names() 
cleaned_bar %>% knitr::kable()
first_name last_initials percent_in n x hi null
Yi C 0.1 1 3 a NA
Do R 0.5 2 2 b NA
NA NA NA 1 NA NULL

tidyr::drop_na()

Remove all rows with any missing data. May not be useful!

cleaned_bar %>% tidyr::drop_na() %>% knitr::kable()
first_name last_initials percent_in n x hi null

tidyr::drop_na(col)

Remove rows with NA in selected columns

cleaned_bar %>% drop_na(hi) %>% knitr::kable()
first_name last_initials percent_in n x hi null
Yi C 0.1 1 3 a NA
Do R 0.5 2 2 b NA

Convert any empty or null strings to NA with na_if()

cleaned_bar <- cleaned_bar %>%
  mutate(last_initials = na_if(last_initials, ""),
         null = na_if(null, "NULL"))
cleaned_bar %>% knitr::kable()
first_name last_initials percent_in n x hi null
Yi C 0.1 1 3 a NA
Do R 0.5 2 2 b NA
NA NA NA NA 1 NA NA

Handling Missing Data with replace_na()and remove_empty()

bar_imputed <- cleaned_bar %>%
  mutate(first_name = replace_na(first_name, "Unknown"),
         last_initials = replace_na(last_initials, "Unknown"),
         percent_in = replace_na(percent_in, mean(percent_in, na.rm = TRUE)),
         n = replace_na(n, median(n, na.rm = TRUE)),
         hi = replace_na(hi, "c")) %>% 
  janitor::remove_empty("cols")
bar_imputed %>% knitr::kable()
first_name last_initials percent_in n x hi
Yi C 0.1 1.0 3 a
Do R 0.5 2.0 2 b
Unknown Unknown 0.3 1.5 1 c

Working Directories

The working directory is where R looks for files and saves files by default.

getwd() # see working directory
setwd() # change your working directory

To set working directory to your STAT 220 course folder

setwd("path/to/stat220-folder/")  # set
getwd()     # check

Useful Terminal Commands:

$ cd   # change directory
$ ls   # unix command to list files
$ pwd  # present working directory
$ grep  # search for patterns in files
$ mkdir # create a new directory
$ mv # move or rename files or directories

Web imports

To your working environment:

url <- "https://raw.githubusercontent.com/deepbas/statdatasets/main/murders.csv"
dat <- read.csv(url) # base R 

To download file to working folder:

download.file(url, "murders.csv")

readr package

  • readr is a part of tidyverse library
  • Includes functions for reading data stored in text file spreadsheets into R.
  • Functions in the package include read_csv(), read_tsv(), read_delim() and more.
  • These differ by the delimiter they use to split columns.

Reading and Writing Files:

All readr functions share a common syntax

library(readr)
df <- read_csv(file = "path/to/file.csv", ...)
  • Generate tibbles rather than data frames

readr functions

function reads
read_csv() Comma separated values
read_csv2() Semi-colon separated values
read_delim() General delimited files
read_fwf() Fixed width files
read_log() Apache log files
read_table() Space separated
read_tsv() Tab delimited values

Data frames and tibbles Conversion

  • as_tibble() - convert a data frame to a tibble
  • as.data.frame() - convert a tibble to a data frame

 Group Activity 1


  • Please clone the ca9-yourusername repository from Github
  • Please do the problem 1 in the class activity for today

10:00

Did it work as expected?

Rows: 549
Columns: 16
$ series                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ episode               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, …
$ baker                 <chr> "Annetha", "David", "Edd", "Jasminder", "Jonatha…
$ technical             <chr> "2nd", "3rd", "1st", "N/A", "9th", "N/A", "8th",…
$ result                <chr> "IN", "IN", "IN", "IN", "IN", "IN", "IN", "IN", …
$ uk_airdate            <chr> "17 August 2010", "17 August 2010", "17 August 2…
$ us_season             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ us_airdate            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ showstopper_chocolate <chr> "chocolate", "chocolate", "no chocolate", "no ch…
$ showstopper_dessert   <chr> "other", "other", "other", "other", "other", "ca…
$ showstopper_fruit     <chr> "no fruit", "no fruit", "no fruit", "no fruit", …
$ showstopper_nut       <chr> "no nut", "no nut", "no nut", "no nut", "almond"…
$ signature_chocolate   <chr> "no chocolate", "chocolate", "no chocolate", "no…
$ signature_dessert     <chr> "cake", "cake", "cake", "cake", "cake", "cake", …
$ signature_fruit       <chr> "no fruit", "fruit", "fruit", "fruit", "fruit", …
$ signature_nut         <chr> "no nut", "no nut", "no nut", "no nut", "no nut"…

We want technical to be numerical and uk_airdate to be date

The col_types argument

By default, looks at first 1000 rows to guess variable data types (guess_max)

desserts <- read_csv(
  "https://raw.githubusercontent.com/deepbas/statdatasets/main/desserts.csv",
  col_types = list( 
    technical = col_number(), 
    uk_airdate = col_date()   
  ) 
)

Looking for problems

List of potential problems parsing the file

problems(desserts)
# A tibble: 556 × 5
     row   col expected        actual         file 
   <int> <int> <chr>           <chr>          <chr>
 1     2     6 date in ISO8601 17 August 2010 ""   
 2     3     6 date in ISO8601 17 August 2010 ""   
 3     4     6 date in ISO8601 17 August 2010 ""   
 4     5     4 a number        N/A            ""   
 5     5     6 date in ISO8601 17 August 2010 ""   
 6     6     6 date in ISO8601 17 August 2010 ""   
 7     7     4 a number        N/A            ""   
 8     7     6 date in ISO8601 17 August 2010 ""   
 9     8     6 date in ISO8601 17 August 2010 ""   
10     9     4 a number        N/A            ""   
# ℹ 546 more rows

Date formatting

# A tibble: 556 × 5
    row   col expected        actual         file 
  <int> <int> <chr>           <chr>          <chr>
1     2     6 date in ISO8601 17 August 2010 ""   
2     3     6 date in ISO8601 17 August 2010 ""   
3     4     6 date in ISO8601 17 August 2010 ""   
4     5     4 a number        N/A            ""   
5     5     6 date in ISO8601 17 August 2010 ""   
# ℹ 551 more rows

ISO8601 format: 2021-10-04

What we have: 17 August 2010

Adding format instructions

desserts <- read_csv(
  "https://raw.githubusercontent.com/deepbas/statdatasets/main/desserts.csv",
    col_types = list(
    technical = col_number(), 
    uk_airdate = col_date(format = "%d %B %Y")  
  ) 
)
  • Year: "%Y" (4 digits). "%y" (2 digits)
  • Month: "%m" (2 digits), "%b" (abbreviated name in current locale), "%B" (full name in current locale).
  • Day: "%d" (2 digits), "%e" (optional leading space)

Looking for more problems

List of potential problems parsing the file

problems(desserts)
# A tibble: 7 × 5
    row   col expected actual file 
  <int> <int> <chr>    <chr>  <chr>
1     5     4 a number N/A    ""   
2     7     4 a number N/A    ""   
3     9     4 a number N/A    ""   
4    11     4 a number N/A    ""   
5    35     4 a number N/A    ""   
6    36     4 a number N/A    ""   
7    37     4 a number N/A    ""   

Addressing missing values

By default na = c("", "NA") are the recognized missing values

desserts <- read_csv(
  "https://raw.githubusercontent.com/deepbas/statdatasets/main/desserts.csv",
  col_types = list(
    technical = col_number(), 
    uk_airdate = col_date(format = "%d %B %Y")
  ),
  na = c("", "NA", "N/A") 
)

No more problems

problems(desserts)
# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>

The Dataset

# A tibble: 549 × 16
   series episode baker     technical result uk_airdate us_season us_airdate
    <dbl>   <dbl> <chr>         <dbl> <chr>  <date>         <dbl> <date>    
 1      1       1 Annetha           2 IN     2010-08-17        NA NA        
 2      1       1 David             3 IN     2010-08-17        NA NA        
 3      1       1 Edd               1 IN     2010-08-17        NA NA        
 4      1       1 Jasminder        NA IN     2010-08-17        NA NA        
 5      1       1 Jonathan          9 IN     2010-08-17        NA NA        
 6      1       1 Louise           NA IN     2010-08-17        NA NA        
 7      1       1 Miranda           8 IN     2010-08-17        NA NA        
 8      1       1 Ruth             NA IN     2010-08-17        NA NA        
 9      1       1 Lea              10 OUT    2010-08-17        NA NA        
10      1       1 Mark             NA OUT    2010-08-17        NA NA        
# ℹ 539 more rows
# ℹ 8 more variables: showstopper_chocolate <chr>, showstopper_dessert <chr>,
#   showstopper_fruit <chr>, showstopper_nut <chr>, signature_chocolate <chr>,
#   signature_dessert <chr>, signature_fruit <chr>, signature_nut <chr>

Column casting functions

Type dplyr::glimpse() readr::col_*()
logical <lgl> col_logical
numeric <int> or <dbl> col_number
character <chr> col_character
factor <fct> col_factor
date <date> col_date

?read_csv

read_csv(file, 
         col_names = TRUE,
         col_types = NULL,
         locale = default_locale(),
         na = c("", "NA"), 
         quoted_na = TRUE,
         quote = "\"", 
         comment = "",
         trim_ws = TRUE,
         skip = 0,
         n_max = Inf,
         guess_max = min(1000, n_max),
         progress = show_progress())

 Group Activity 2


  • Please do the remaining problems in the class activity.
  • Submit to Gradescope on moodle when done!

10:00