Basic String Manipulation

STAT 220

Bastola

Let’s Define Strings

A string is any sequence of characters
Define a string by surrounding text with either single quotes or double quotes.

s <- "Hello!"    # double quotes define a string
s <- 'Hello!'    # single quotes define a string

The cat() or writeLines() function displays a string as it is represented inside R.

cat(s)

Hello!

writeLines(s)

Hello!

s <- `Hello`    # backquotes do not define a string
s <- "10""    # error - unclosed quotes

String Parsing

Definition: pulling apart some text or string to do something with it

The most common tasks in string processing include:

extracting numbers from strings, e.g. “12%”
removing unwanted characters from text, e.g. “New Jersey_*”
finding and replacing characters, e.g. “2,150”
extracting specific parts of strings, e.g. “Learning #datascience is fun!”
splitting strings into multiple values, e.g. “123 Main St, Springfield, MA, 01101”

Regular expressions: Regex

Regular expressions are a language for expressing patterns in strings

Regex can include special characters unlike a regular string
To use regex in R, you need to use the stringr package

`stringr` package

detecting, locating, extracting and replacing elements of strings.
begin with str_ and take the string as the first argument

stringr cheatsheet

Special characters

The “escape” backslash \ is used to escape the special use of certain characters

writeLines("\"")
# "
writeLines("\\")
# \
writeLines("Math\\Stats")
# Math\Stats

To include both single and double quotes in string, escape with \

s <- '5\'10"'    # outer single quote
writeLines(s)
# 5'10"

s <- "5'10\""    # outer double quote
writeLines(s)
# 5'10"

Combining strings

str_c("iron", "wine")

[1] "ironwine"

str_flatten(c("iron", "wine"), 
            collapse = " and ")

[1] "iron and wine"

a <- c("a", "b", "c")
b <- c("A", "B", "C")
str_c(a, b)

[1] "aA" "bB" "cC"

Combining strings

building <- "CMC"
room <- "102"
begin_time <- "11:10 a.m."
end_time <- "12:40 p.m."
days <- "MWF"
class <- "STAT 220"

str_c(class, "meets from", 
      begin_time, "to", end_time, 
      days, "in", building, room, sep = " ")

[1] "STAT 220 meets from 11:10 a.m. to 12:40 p.m. MWF in CMC 102"

`str_length()`

tells you how many characters are in each entry of a character vector

gss_cat %>% names()

[1] "year"    "marital" "age"     "race"    "rincome" "partyid" "relig"  
[8] "denom"   "tvhours"

# length of each column names 
gss_cat %>% names() %>% str_length()

[1] 4 7 3 4 7 7 5 5 7

`str_count()`

counts the number of non-overlapping matches of a pattern in each entry of a character vector

gss_cat %>% names()

[1] "year"    "marital" "age"     "race"    "rincome" "partyid" "relig"  
[8] "denom"   "tvhours"

# count number of vowels in each column name
vowels_pattern <- "[aeiouAEIOU]"
gss_cat %>% names() %>% str_count(vowels_pattern)

[1] 2 3 2 2 3 2 2 2 2

`str_glue()`

allows one to interpolate strings and values that have been assigned to names in R

y <- lubridate::now() # current date
str_glue("today is {y}.")

today is 2024-05-14 22:38:04.584507.

name <- c("Alex", "Mia")
dob <- c(lubridate::ymd("1992/12/24"), lubridate::ymd("1994/02/14"))
str_glue("Hi, my name is {name} and I was born in {dob}.")

Hi, my name is Alex and I was born in 1992-12-24.
Hi, my name is Mia and I was born in 1994-02-14.

`str_sub()`

Extract and replace substrings from a character vector

phrase <- "cellar door"
str_sub(phrase, start = 1, end = 6)

[1] "cellar"

str_sub(phrase, start = c(1,8), end = c(6,11))

[1] "cellar" "door"

str_sub(phrase, start = c(1,8), end = c(6,11)) %>% 
  str_flatten(collapse = " ")

[1] "cellar door"

Please clone the ca11-yourusername repository from Github
Please complete problem 1 in today’s class activity.

15:00

More Special Characters

The | symbol inside a regex means "or"
The [abe] means one of a,b, or e
Use \\n to match a newline character
Use \\s to match white space characters (spaces, tabs, and newlines)
Use \\w to match alphanumeric characters (letters and numbers)
- can also use [:alnum:]
Use \\d to represent digits (numbers)
- can also use [:digit:]

Click here for extensive lists

stringr cheatsheet

More Special Characters

^ = start of a string
$ = end of a string
. = any character

Quantifiers

* = matches the preceding character any number of times
+ = matches the preceding character once
? = matches the preceding character at most once (i.e. optionally)
{n} = matches the preceding character exactly n times

Try more regexes here

Finding strings

days <- c("Monday", "Tuesday", "Wednesday", 
              "Thursday", "Friday", "Saturday", "Sunday")

str_detect(days, "^[Ss]un.*") # returns a logical vector

[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

days %>% 
  str_which("^T")  # indices of matching entries

[1] 2 4

days %>%
  str_locate("day") # start and end indices of matching pattern in a matrix/array

     start end
[1,]     4   6
[2,]     5   7
[3,]     7   9
[4,]     6   8
[5,]     4   6
[6,]     6   8
[7,]     4   6

`str_extract()`

Extract just the part of the string matching the specified regex instead of the entire entry

name_phone <- c("Moly: 250-999-8878", 
       "Ali: 416-908-2044", 
       "Eli: 204-192-9829", 
       "May: 250-209-7047")
str_extract(name_phone,  "\\w+")

[1] "Moly" "Ali"  "Eli"  "May"

Click for Hint

Extracts the first word from each string in the given vector

`str_split()`

splits a string into a list or matrix of pieces based on a supplied pattern

str_split(c("a_3", "d_4"), pattern = "_") # returns a list

[[1]]
[1] "a" "3"

[[2]]
[1] "d" "4"

str_split(c("a_3", "d_4"), pattern = "_", 
          simplify = TRUE) # returns a matrix

     [,1] [,2]
[1,] "a"  "3" 
[2,] "d"  "4"

`str_replace()`

Replaces the first instance of the detected pattern with a specified string.

gss_cat %>% 
  names() %>% 
  str_replace(pattern = "^.{3}", # any 1st 3 characters 
              replacement = "X_")

[1] "X_r"    "X_ital" "X_"     "X_e"    "X_come" "X_tyid" "X_ig"   "X_om"  
[9] "X_ours"

str_replace_all()

murders

# A tibble: 51 × 4
   state                population total murder_rate
   <chr>                <chr>      <chr>       <dbl>
 1 Alabama              4,853,875  348           7.2
 2 Alaska               737,709    59            8  
 3 Arizona              6,817,565  309           4.5
 4 Arkansas             2,977,853  181           6.1
 5 California           38,993,940 1,861         4.8
 6 Colorado             5,448,819  176           3.2
 7 Connecticut          3,584,730  117           3.3
 8 Delaware             944,076    63            6.7
 9 District of Columbia 670,377    162          24.2
10 Florida              20,244,914 1,041         5.1
# ℹ 41 more rows

str_replace_all()

murders %>% 
  mutate(population = str_replace_all(population, ",", ""),
         total = str_replace_all(total, ",", ""))

# A tibble: 51 × 4
   state                population total murder_rate
   <chr>                <chr>      <chr>       <dbl>
 1 Alabama              4853875    348           7.2
 2 Alaska               737709     59            8  
 3 Arizona              6817565    309           4.5
 4 Arkansas             2977853    181           6.1
 5 California           38993940   1861          4.8
 6 Colorado             5448819    176           3.2
 7 Connecticut          3584730    117           3.3
 8 Delaware             944076     63            6.7
 9 District of Columbia 670377     162          24.2
10 Florida              20244914   1041          5.1
# ℹ 41 more rows

str_replace_all()

murders %>% 
  mutate(population = str_replace_all(population, ",", ""),
         total = str_replace_all(total, ",", "")) %>% 
  mutate_at(vars(2:3), as.double)

# A tibble: 51 × 4
   state                population total murder_rate
   <chr>                     <dbl> <dbl>       <dbl>
 1 Alabama                 4853875   348         7.2
 2 Alaska                   737709    59         8  
 3 Arizona                 6817565   309         4.5
 4 Arkansas                2977853   181         6.1
 5 California             38993940  1861         4.8
 6 Colorado                5448819   176         3.2
 7 Connecticut             3584730   117         3.3
 8 Delaware                 944076    63         6.7
 9 District of Columbia     670377   162        24.2
10 Florida                20244914  1041         5.1
# ℹ 41 more rows

Please do the remaining problems in the class activity.
Submit to Gradescope on moodle when done!

15:00

Basic String Manipulation

Let’s Define Strings

String Parsing

Regular expressions: Regex

stringr package

Special characters

Combining strings

Combining strings

str_length()

str_count()

str_glue()

str_sub()

Group Activity 1

More Special Characters

More Special Characters

Quantifiers

Finding strings

str_extract()

str_split()

str_replace()

str_replace_all()

str_replace_all()

str_replace_all()

Group Activity 2

`stringr` package

`str_length()`

`str_count()`

`str_glue()`

`str_sub()`

`str_extract()`

`str_split()`

`str_replace()`