Data Imports

STAT 220

Bastola

Let’s start with a messy data!

bar <- tibble(`First Name` = c("Yi", "Do", NA),
              `Last initials` = c("C", "R", ""),
              `% in` = c(0.1, 0.5, NA),
              `ñ$$$` = c(1, 2, NA),
              ` ` = c(3, 2, 1),
              ` hi` = c("a", "b", NA),
              `null` = c(NA, NA, "NULL"))

bar %>% knitr::kable()

First Name	Last initials	% in	ñ$$$		hi	null
Yi	C	0.1	1	3	a	NA
Do	R	0.5	2	2	b	NA
NA		NA	NA	1	NA	NULL

Clean messy column names

library(janitor)
cleaned_bar <- bar %>%
  janitor::clean_names() 
cleaned_bar %>% knitr::kable()

first_name	last_initials	percent_in	n	x	hi	null
Yi	C	0.1	1	3	a	NA
Do	R	0.5	2	2	b	NA
NA		NA	NA	1	NA	NULL

`tidyr::drop_na()`

Remove all rows with any missing data. May not be useful!

cleaned_bar %>% tidyr::drop_na() %>% knitr::kable()

first_name	last_initials	percent_in	n	x	hi	null

`tidyr::drop_na(col)`

Remove rows with NA in selected columns

cleaned_bar %>% drop_na(hi) %>% knitr::kable()

first_name	last_initials	percent_in	n	x	hi	null
Yi	C	0.1	1	3	a	NA
Do	R	0.5	2	2	b	NA

Convert any empty or null strings to NA with `na_if()`

cleaned_bar <- cleaned_bar %>%
  mutate(last_initials = na_if(last_initials, ""),
         null = na_if(null, "NULL"))
cleaned_bar %>% knitr::kable()

first_name	last_initials	percent_in	n	x	hi	null
Yi	C	0.1	1	3	a	NA
Do	R	0.5	2	2	b	NA
NA	NA	NA	NA	1	NA	NA

Handling Missing Data with `replace_na()`and `remove_empty()`

bar_imputed <- cleaned_bar %>%
  mutate(first_name = replace_na(first_name, "Unknown"),
         last_initials = replace_na(last_initials, "Unknown"),
         percent_in = replace_na(percent_in, mean(percent_in, na.rm = TRUE)),
         n = replace_na(n, median(n, na.rm = TRUE)),
         hi = replace_na(hi, "c")) %>% 
  janitor::remove_empty("cols")
bar_imputed %>% knitr::kable()

first_name	last_initials	percent_in	n	x	hi
Yi	C	0.1	1.0	3	a
Do	R	0.5	2.0	2	b
Unknown	Unknown	0.3	1.5	1	c

Working Directories

The working directory is where R looks for files and saves files by default.

getwd() # see working directory
setwd() # change your working directory

To set working directory to your STAT 220 course folder

setwd("path/to/stat220-folder/")  # set
getwd()     # check

Useful Terminal Commands:

$ cd   # change directory
$ ls   # unix command to list files
$ pwd  # present working directory
$ grep  # search for patterns in files
$ mkdir # create a new directory
$ mv # move or rename files or directories

Web imports

To your working environment:

url <- "https://raw.githubusercontent.com/deepbas/statdatasets/main/murders.csv"
dat <- read.csv(url) # base R

To download file to working folder:

download.file(url, "murders.csv")

`readr` package

readr is a part of tidyverse library
Includes functions for reading data stored in text file spreadsheets into R.
Functions in the package include read_csv(), read_tsv(), read_delim() and more.
These differ by the delimiter they use to split columns.

Reading and Writing Files:

All readr functions share a common syntax

library(readr)
df <- read_csv(file = "path/to/file.csv", ...)

Generate tibbles rather than data frames

`readr` functions

function	reads
read_csv()	Comma separated values
read_csv2()	Semi-colon separated values
read_delim()	General delimited files
read_fwf()	Fixed width files
read_log()	Apache log files
read_table()	Space separated
read_tsv()	Tab delimited values

Data frames and tibbles Conversion

as_tibble() - convert a data frame to a tibble
as.data.frame() - convert a tibble to a data frame

Please clone the ca9-yourusername repository from Github
Please do the problem 1 in the class activity for today

10:00

Did it work as expected?

Rows: 549
Columns: 16
$ series                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ episode               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, …
$ baker                 <chr> "Annetha", "David", "Edd", "Jasminder", "Jonatha…
$ technical             <chr> "2nd", "3rd", "1st", "N/A", "9th", "N/A", "8th",…
$ result                <chr> "IN", "IN", "IN", "IN", "IN", "IN", "IN", "IN", …
$ uk_airdate            <chr> "17 August 2010", "17 August 2010", "17 August 2…
$ us_season             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ us_airdate            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ showstopper_chocolate <chr> "chocolate", "chocolate", "no chocolate", "no ch…
$ showstopper_dessert   <chr> "other", "other", "other", "other", "other", "ca…
$ showstopper_fruit     <chr> "no fruit", "no fruit", "no fruit", "no fruit", …
$ showstopper_nut       <chr> "no nut", "no nut", "no nut", "no nut", "almond"…
$ signature_chocolate   <chr> "no chocolate", "chocolate", "no chocolate", "no…
$ signature_dessert     <chr> "cake", "cake", "cake", "cake", "cake", "cake", …
$ signature_fruit       <chr> "no fruit", "fruit", "fruit", "fruit", "fruit", …
$ signature_nut         <chr> "no nut", "no nut", "no nut", "no nut", "no nut"…

We want technical to be numerical and uk_airdate to be date

The `col_types` argument

By default, looks at first 1000 rows to guess variable data types (guess_max)

desserts <- read_csv(
  "https://raw.githubusercontent.com/deepbas/statdatasets/main/desserts.csv",
  col_types = list( 
    technical = col_number(), 
    uk_airdate = col_date()   
  ) 
)

Looking for `problems`

List of potential problems parsing the file

problems(desserts)

# A tibble: 556 × 5
     row   col expected        actual         file 
   <int> <int> <chr>           <chr>          <chr>
 1     2     6 date in ISO8601 17 August 2010 ""   
 2     3     6 date in ISO8601 17 August 2010 ""   
 3     4     6 date in ISO8601 17 August 2010 ""   
 4     5     4 a number        N/A            ""   
 5     5     6 date in ISO8601 17 August 2010 ""   
 6     6     6 date in ISO8601 17 August 2010 ""   
 7     7     4 a number        N/A            ""   
 8     7     6 date in ISO8601 17 August 2010 ""   
 9     8     6 date in ISO8601 17 August 2010 ""   
10     9     4 a number        N/A            ""   
# ℹ 546 more rows

Date formatting

# A tibble: 556 × 5
    row   col expected        actual         file 
  <int> <int> <chr>           <chr>          <chr>
1     2     6 date in ISO8601 17 August 2010 ""   
2     3     6 date in ISO8601 17 August 2010 ""   
3     4     6 date in ISO8601 17 August 2010 ""   
4     5     4 a number        N/A            ""   
5     5     6 date in ISO8601 17 August 2010 ""   
# ℹ 551 more rows

ISO8601 format: 2021-10-04

What we have: 17 August 2010

Adding `format` instructions

desserts <- read_csv(
  "https://raw.githubusercontent.com/deepbas/statdatasets/main/desserts.csv",
    col_types = list(
    technical = col_number(), 
    uk_airdate = col_date(format = "%d %B %Y")  
  ) 
)

Year: "%Y" (4 digits). "%y" (2 digits)
Month: "%m" (2 digits), "%b" (abbreviated name in current locale), "%B" (full name in current locale).
Day: "%d" (2 digits), "%e" (optional leading space)

Looking for more `problems`

List of potential problems parsing the file

problems(desserts)

# A tibble: 7 × 5
    row   col expected actual file 
  <int> <int> <chr>    <chr>  <chr>
1     5     4 a number N/A    ""   
2     7     4 a number N/A    ""   
3     9     4 a number N/A    ""   
4    11     4 a number N/A    ""   
5    35     4 a number N/A    ""   
6    36     4 a number N/A    ""   
7    37     4 a number N/A    ""

Addressing missing values

By default na = c("", "NA") are the recognized missing values

desserts <- read_csv(
  "https://raw.githubusercontent.com/deepbas/statdatasets/main/desserts.csv",
  col_types = list(
    technical = col_number(), 
    uk_airdate = col_date(format = "%d %B %Y")
  ),
  na = c("", "NA", "N/A") 
)

No more `problems`

problems(desserts)

# A tibble: 0 × 5
# ℹ 5 variables: row <int>, col <int>, expected <chr>, actual <chr>, file <chr>

The Dataset

# A tibble: 549 × 16
   series episode baker     technical result uk_airdate us_season us_airdate
    <dbl>   <dbl> <chr>         <dbl> <chr>  <date>         <dbl> <date>    
 1      1       1 Annetha           2 IN     2010-08-17        NA NA        
 2      1       1 David             3 IN     2010-08-17        NA NA        
 3      1       1 Edd               1 IN     2010-08-17        NA NA        
 4      1       1 Jasminder        NA IN     2010-08-17        NA NA        
 5      1       1 Jonathan          9 IN     2010-08-17        NA NA        
 6      1       1 Louise           NA IN     2010-08-17        NA NA        
 7      1       1 Miranda           8 IN     2010-08-17        NA NA        
 8      1       1 Ruth             NA IN     2010-08-17        NA NA        
 9      1       1 Lea              10 OUT    2010-08-17        NA NA        
10      1       1 Mark             NA OUT    2010-08-17        NA NA        
# ℹ 539 more rows
# ℹ 8 more variables: showstopper_chocolate <chr>, showstopper_dessert <chr>,
#   showstopper_fruit <chr>, showstopper_nut <chr>, signature_chocolate <chr>,
#   signature_dessert <chr>, signature_fruit <chr>, signature_nut <chr>

Column casting functions

Type	dplyr::glimpse()	readr::col_*()
logical	`<lgl>`	`col_logical`
numeric	`<int>` or `<dbl>`	`col_number`
character	`<chr>`	`col_character`
factor	`<fct>`	`col_factor`
date	`<date>`	`col_date`

`?read_csv`

read_csv(file, 
         col_names = TRUE,
         col_types = NULL,
         locale = default_locale(),
         na = c("", "NA"), 
         quoted_na = TRUE,
         quote = "\"", 
         comment = "",
         trim_ws = TRUE,
         skip = 0,
         n_max = Inf,
         guess_max = min(1000, n_max),
         progress = show_progress())

Please do the remaining problems in the class activity.
Submit to Gradescope on moodle when done!

10:00

Data Imports

Let’s start with a messy data!

Clean messy column names

tidyr::drop_na()

tidyr::drop_na(col)

Convert any empty or null strings to NA with na_if()

Handling Missing Data with replace_na()and remove_empty()

Working Directories

Useful Terminal Commands:

Web imports

readr package

Reading and Writing Files:

readr functions

Data frames and tibbles Conversion

Group Activity 1