Data wrangling with dplyr

STAT 220

Bastola

Boolean operators

For help, ?base::Logic

Source: R for Data Science, by Grolemund & Wickham

rename()

polluted_cities <- tibble(
  city = c("New York", "London", "Doha",
           "Hongkong", "Beijing"),
  size = c("large", "large", "small",
           "small", "large"),
  amount = c(23.4, 22.2, 12.4, 
             15.6, 121.0)
)

polluted_cities %>% 
  rename(pollution = amount)
# A tibble: 5 × 3
  city     size  pollution
  <chr>    <chr>     <dbl>
1 New York large      23.4
2 London   large      22.2
3 Doha     small      12.4
4 Hongkong small      15.6
5 Beijing  large     121  

Mutating multiple columns at once: mutate_*


  • variants of mutate() that are useful for mutating multiple columns at once
    • mutate_at(), mutate_if(), mutate_all(), etc.
  • which columns get mutated depends on a predicate, can be:
    • a function that returns TRUE/FALSE like is.numeric()
    • variable names through vars()

Mutating multiple columns at once: mutate_*

polluted_cities %>% 
  mutate_at(vars(city:amount), toupper)
# A tibble: 5 × 3
  city     size  amount
  <chr>    <chr> <chr> 
1 NEW YORK LARGE 23.4  
2 LONDON   LARGE 22.2  
3 DOHA     SMALL 12.4  
4 HONGKONG SMALL 15.6  
5 BEIJING  LARGE 121   

Mutating multiple columns at once: mutate_*

polluted_cities %>% 
  mutate_if(is.double, round, digits = 0)
# A tibble: 5 × 3
  city     size  amount
  <chr>    <chr>  <dbl>
1 New York large     23
2 London   large     22
3 Doha     small     12
4 Hongkong small     16
5 Beijing  large    121

Selecting & renaming multiple columns

  • select_*() & rename_*() are variants of select() and rename()
  • use like mutate_*() options on previous slide

 Group Activity 1

  • Please clone the ca7-yourusername repository from Github
  • Please do the problem 1 in the class activity for today
  • Submit to Gradescope on moodle when done!

10:00

Join types



Two-table verbs


  • inner_join() - Merge two datasets. Exclude all unmatched rows.

  • full_join() - Merge two datasets. Keep all observations.

  • left_join() - Merge two datasets. Keep all observations from the origin table.

  • right_join() - Merge two datasets. Keep all observations from the destination table.

  • anti_join() - Drops all observations in origin that have a match in destination table.

Mutating Joins


Differ in their behavior when a match is not found

  • left_join()

  • right_join()

  • inner_join()

  • full_join()

left_join()

df1 <- tibble(x = c(1, 2), y = 2:1)
df2 <- tibble(x = c(3, 1), a = c(10,100), b = c("a", "r"))
df1
x y
1 2
2 1
df2
x a b
3 10 a
1 100 r
df1 %>% left_join(df2)
x y a b
1 2 100 r
2 1 NA NA

right_join()

df1 <- tibble(x = c(1, 2), y = 2:1)
df2 <- tibble(x = c(3, 1), a = c(10,100), b = c("a", "r"))
df1
x y
1 2
2 1
df2
x a b
3 10 a
1 100 r
df1 %>% right_join(df2)
x y a b
1 2 100 r
3 NA 10 a

inner_join()

df1 <- tibble(x = c(1, 2), y = 2:1)
df2 <- tibble(x = c(3, 1), a = c(10,100), b = c("a", "r"))
df1
x y
1 2
2 1
df2
x a b
3 10 a
1 100 r
df1 %>% inner_join(df2)
x y a b
1 2 100 r

Flights data

library(nycflights13)
flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)


head(flights2)
# A tibble: 6 × 8
   year month   day  hour origin dest  tailnum carrier
  <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
1  2013     1     1     5 EWR    IAH   N14228  UA     
2  2013     1     1     5 LGA    IAH   N24211  UA     
3  2013     1     1     5 JFK    MIA   N619AA  AA     
4  2013     1     1     5 JFK    BQN   N804JB  B6     
5  2013     1     1     6 LGA    ATL   N668DN  DL     
6  2013     1     1     5 EWR    ORD   N39463  UA     

Airline information

head(airlines)
# A tibble: 6 × 2
  carrier name                    
  <chr>   <chr>                   
1 9E      Endeavor Air Inc.       
2 AA      American Airlines Inc.  
3 AS      Alaska Airlines Inc.    
4 B6      JetBlue Airways         
5 DL      Delta Air Lines Inc.    
6 EV      ExpressJet Airlines Inc.

left_join()

flights2 %>% 
  left_join(airlines)
# A tibble: 336,776 × 9
    year month   day  hour origin dest  tailnum carrier name                    
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>                   
 1  2013     1     1     5 EWR    IAH   N14228  UA      United Air Lines Inc.   
 2  2013     1     1     5 LGA    IAH   N24211  UA      United Air Lines Inc.   
 3  2013     1     1     5 JFK    MIA   N619AA  AA      American Airlines Inc.  
 4  2013     1     1     5 JFK    BQN   N804JB  B6      JetBlue Airways         
 5  2013     1     1     6 LGA    ATL   N668DN  DL      Delta Air Lines Inc.    
 6  2013     1     1     5 EWR    ORD   N39463  UA      United Air Lines Inc.   
 7  2013     1     1     6 EWR    FLL   N516JB  B6      JetBlue Airways         
 8  2013     1     1     6 LGA    IAD   N829AS  EV      ExpressJet Airlines Inc.
 9  2013     1     1     6 JFK    MCO   N593JB  B6      JetBlue Airways         
10  2013     1     1     6 LGA    ORD   N3ALAA  AA      American Airlines Inc.  
# ℹ 336,766 more rows

Planes information

head(planes)
# A tibble: 6 × 9
  tailnum  year type               manufacturer model engines seats speed engine
  <chr>   <int> <chr>              <chr>        <chr>   <int> <int> <int> <chr> 
1 N10156   2004 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
2 N102UW   1998 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
3 N103US   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
4 N104UW   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
5 N10575   2002 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
6 N105UW   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…

Keys: controlling how the tables are matched

flights2 %>% left_join(planes, by = "tailnum")
# A tibble: 336,776 × 16
   year.x month   day  hour origin dest  tailnum carrier year.y type            
    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>           
 1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixed wing mult…
 2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixed wing mult…
 3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixed wing mult…
 4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixed wing mult…
 5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixed wing mult…
 6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixed wing mult…
 7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixed wing mult…
 8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixed wing mult…
 9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixed wing mult…
10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA>            
# ℹ 336,766 more rows
# ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
#   seats <int>, speed <int>, engine <chr>

Airport information

head(airports)
# A tibble: 6 × 8
  faa   name                             lat   lon   alt    tz dst   tzone      
  <chr> <chr>                          <dbl> <dbl> <dbl> <dbl> <chr> <chr>      
1 04G   Lansdowne Airport               41.1 -80.6  1044    -5 A     America/Ne…
2 06A   Moton Field Municipal Airport   32.5 -85.7   264    -6 A     America/Ch…
3 06C   Schaumburg Regional             42.0 -88.1   801    -6 A     America/Ch…
4 06N   Randall Airport                 41.4 -74.4   523    -5 A     America/Ne…
5 09J   Jekyll Island Airport           31.1 -81.4    11    -5 A     America/Ne…
6 0A9   Elizabethton Municipal Airport  36.4 -82.2  1593    -5 A     America/Ne…

Matching keys

flights2 %>% left_join(airports, by = c("origin" = "faa"))
# A tibble: 336,776 × 15
    year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
 1  2013     1     1     5 EWR    IAH   N14228  UA      Newar…  40.7 -74.2    18
 2  2013     1     1     5 LGA    IAH   N24211  UA      La Gu…  40.8 -73.9    22
 3  2013     1     1     5 JFK    MIA   N619AA  AA      John …  40.6 -73.8    13
 4  2013     1     1     5 JFK    BQN   N804JB  B6      John …  40.6 -73.8    13
 5  2013     1     1     6 LGA    ATL   N668DN  DL      La Gu…  40.8 -73.9    22
 6  2013     1     1     5 EWR    ORD   N39463  UA      Newar…  40.7 -74.2    18
 7  2013     1     1     6 EWR    FLL   N516JB  B6      Newar…  40.7 -74.2    18
 8  2013     1     1     6 LGA    IAD   N829AS  EV      La Gu…  40.8 -73.9    22
 9  2013     1     1     6 JFK    MCO   N593JB  B6      John …  40.6 -73.8    13
10  2013     1     1     6 LGA    ORD   N3ALAA  AA      La Gu…  40.8 -73.9    22
# ℹ 336,766 more rows
# ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>

Filtering joins

Most useful for diagnosing join mismatches

Filtering joins return a copy of the dataset that has been filtered, not augmented (as with mutating joins)

  • semi_join(x,y) : keeps all observations in x that have a match in y.
  • anti_join(x,y) : drops all observations in x that have a match in y.

Another example: semi_join()

df1
x y
1 1
1 2
3 3
4 4
df2
x z
1 a
1 b
2 a
df1 %>% semi_join(df2, by = "x")
x y
1 1
1 2

Another example: anti_join()

df1
x y
1 1
1 2
3 3
4 4
df2
x z
1 a
1 b
2 a
df1 %>% anti_join(df2, by = "x")
x y
3 3
4 4

Set Operations

These expect the x and y inputs to have the same variables, and treat the observations like sets:

  • intersect(x,y)
    • will return only the rows that appear in both datasets
  • union(x,y)
    • return every row that appears in one or more of the datasets
    • If a row appears multiple times union will only return it once
  • setdiff(x,y)
    • will return the rows that appear in the first dataset but not the second

Set operations

df1
x y
1 1
2 1
df2
x y
1 1
2 2
intersect(df1, df2)
x y
1 1
setdiff(df1, df2)
x y
2 1
union(df1, df2)
x y
1 1
2 1
2 2

 Group Activity 2


  • Please do the remaining problems in the class activity.
  • Submit to Gradescope on moodle when done!

10:00