Basic String Manipulation

STAT 220

Bastola

Let’s Define Strings

  • A string is any sequence of characters
  • Define a string by surrounding text with either single quotes or double quotes.
s <- "Hello!"    # double quotes define a string
s <- 'Hello!'    # single quotes define a string  

The cat() or writeLines() function displays a string as it is represented inside R.

cat(s)
Hello!
writeLines(s)
Hello!
s <- `Hello`    # backquotes do not define a string
s <- "10""    # error - unclosed quotes

String Parsing

Definition: pulling apart some text or string to do something with it

The most common tasks in string processing include:

  • extracting numbers from strings, e.g. “12%”
  • removing unwanted characters from text, e.g. “New Jersey_*”
  • finding and replacing characters, e.g. “2,150”
  • extracting specific parts of strings, e.g. “Learning #datascience is fun!”
  • splitting strings into multiple values, e.g. “123 Main St, Springfield, MA, 01101”

Regular expressions: Regex

Regular expressions are a language for expressing patterns in strings

  • Regex can include special characters unlike a regular string
  • To use regex in R, you need to use the stringr package

stringr package


  • detecting, locating, extracting and replacing elements of strings.
  • begin with str_ and take the string as the first argument


stringr cheatsheet

Special characters

The “escape” backslash \ is used to escape the special use of certain characters

writeLines("\"")
# "
writeLines("\\")
# \
writeLines("Math\\Stats")
# Math\Stats

To include both single and double quotes in string, escape with \


s <- '5\'10"'    # outer single quote
writeLines(s)
# 5'10"
s <- "5'10\""    # outer double quote
writeLines(s)
# 5'10"

Combining strings

str_c("iron", "wine")
[1] "ironwine"
str_flatten(c("iron", "wine"), 
            collapse = " and ")
[1] "iron and wine"
a <- c("a", "b", "c")
b <- c("A", "B", "C")
str_c(a, b) 
[1] "aA" "bB" "cC"

Combining strings

building <- "CMC"
room <- "102"
begin_time <- "11:10 a.m."
end_time <- "12:40 p.m."
days <- "MWF"
class <- "STAT 220"
str_c(class, "meets from", 
      begin_time, "to", end_time, 
      days, "in", building, room, sep = " ")
[1] "STAT 220 meets from 11:10 a.m. to 12:40 p.m. MWF in CMC 102"

str_length()

tells you how many characters are in each entry of a character vector

gss_cat %>% names()
[1] "year"    "marital" "age"     "race"    "rincome" "partyid" "relig"  
[8] "denom"   "tvhours"
# length of each column names 
gss_cat %>% names() %>% str_length()
[1] 4 7 3 4 7 7 5 5 7

str_count()

counts the number of non-overlapping matches of a pattern in each entry of a character vector

gss_cat %>% names()
[1] "year"    "marital" "age"     "race"    "rincome" "partyid" "relig"  
[8] "denom"   "tvhours"
# count number of vowels in each column name
vowels_pattern <- "[aeiouAEIOU]"
gss_cat %>% names() %>% str_count(vowels_pattern)
[1] 2 3 2 2 3 2 2 2 2

str_glue()

allows one to interpolate strings and values that have been assigned to names in R

y <- lubridate::now() # current date
str_glue("today is {y}.")
today is 2024-05-14 22:38:04.584507.
name <- c("Alex", "Mia")
dob <- c(lubridate::ymd("1992/12/24"), lubridate::ymd("1994/02/14"))
str_glue("Hi, my name is {name} and I was born in {dob}.")
Hi, my name is Alex and I was born in 1992-12-24.
Hi, my name is Mia and I was born in 1994-02-14.

str_sub()

Extract and replace substrings from a character vector

phrase <- "cellar door"
str_sub(phrase, start = 1, end = 6)
[1] "cellar"
str_sub(phrase, start = c(1,8), end = c(6,11))
[1] "cellar" "door"  
str_sub(phrase, start = c(1,8), end = c(6,11)) %>% 
  str_flatten(collapse = " ")
[1] "cellar door"

 Group Activity 1


  • Please clone the ca11-yourusername repository from Github
  • Please complete problem 1 in today’s class activity.

15:00

More Special Characters

  • The | symbol inside a regex means "or"
  • The [abe] means one of a,b, or e
  • Use \\n to match a newline character
  • Use \\s to match white space characters (spaces, tabs, and newlines)
  • Use \\w to match alphanumeric characters (letters and numbers)
    • can also use [:alnum:]
  • Use \\d to represent digits (numbers)
    • can also use [:digit:]

Click here for extensive lists

stringr cheatsheet

More Special Characters

  • ^ = start of a string
  • $ = end of a string
  • . = any character

Quantifiers

  • * = matches the preceding character any number of times
  • + = matches the preceding character once
  • ? = matches the preceding character at most once (i.e. optionally)
  • {n} = matches the preceding character exactly n times

Try more regexes here

Finding strings

days <- c("Monday", "Tuesday", "Wednesday", 
              "Thursday", "Friday", "Saturday", "Sunday")
str_detect(days, "^[Ss]un.*") # returns a logical vector
[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
days %>% 
  str_which("^T")  # indices of matching entries  
[1] 2 4
days %>%
  str_locate("day") # start and end indices of matching pattern in a matrix/array
     start end
[1,]     4   6
[2,]     5   7
[3,]     7   9
[4,]     6   8
[5,]     4   6
[6,]     6   8
[7,]     4   6

str_extract()

Extract just the part of the string matching the specified regex instead of the entire entry

name_phone <- c("Moly: 250-999-8878", 
       "Ali: 416-908-2044", 
       "Eli: 204-192-9829", 
       "May: 250-209-7047")
str_extract(name_phone,  "\\w+") 
[1] "Moly" "Ali"  "Eli"  "May" 
Click for Hint Extracts the first word from each string in the given vector

str_split()

splits a string into a list or matrix of pieces based on a supplied pattern

str_split(c("a_3", "d_4"), pattern = "_") # returns a list
[[1]]
[1] "a" "3"

[[2]]
[1] "d" "4"
str_split(c("a_3", "d_4"), pattern = "_", 
          simplify = TRUE) # returns a matrix
     [,1] [,2]
[1,] "a"  "3" 
[2,] "d"  "4" 

str_replace()

Replaces the first instance of the detected pattern with a specified string.

gss_cat %>% 
  names() %>% 
  str_replace(pattern = "^.{3}", # any 1st 3 characters 
              replacement = "X_")
[1] "X_r"    "X_ital" "X_"     "X_e"    "X_come" "X_tyid" "X_ig"   "X_om"  
[9] "X_ours"

str_replace_all()

murders
# A tibble: 51 × 4
   state                population total murder_rate
   <chr>                <chr>      <chr>       <dbl>
 1 Alabama              4,853,875  348           7.2
 2 Alaska               737,709    59            8  
 3 Arizona              6,817,565  309           4.5
 4 Arkansas             2,977,853  181           6.1
 5 California           38,993,940 1,861         4.8
 6 Colorado             5,448,819  176           3.2
 7 Connecticut          3,584,730  117           3.3
 8 Delaware             944,076    63            6.7
 9 District of Columbia 670,377    162          24.2
10 Florida              20,244,914 1,041         5.1
# ℹ 41 more rows

str_replace_all()

murders %>% 
  mutate(population = str_replace_all(population, ",", ""),
         total = str_replace_all(total, ",", "")) 
# A tibble: 51 × 4
   state                population total murder_rate
   <chr>                <chr>      <chr>       <dbl>
 1 Alabama              4853875    348           7.2
 2 Alaska               737709     59            8  
 3 Arizona              6817565    309           4.5
 4 Arkansas             2977853    181           6.1
 5 California           38993940   1861          4.8
 6 Colorado             5448819    176           3.2
 7 Connecticut          3584730    117           3.3
 8 Delaware             944076     63            6.7
 9 District of Columbia 670377     162          24.2
10 Florida              20244914   1041          5.1
# ℹ 41 more rows

str_replace_all()

murders %>% 
  mutate(population = str_replace_all(population, ",", ""),
         total = str_replace_all(total, ",", "")) %>% 
  mutate_at(vars(2:3), as.double) 
# A tibble: 51 × 4
   state                population total murder_rate
   <chr>                     <dbl> <dbl>       <dbl>
 1 Alabama                 4853875   348         7.2
 2 Alaska                   737709    59         8  
 3 Arizona                 6817565   309         4.5
 4 Arkansas                2977853   181         6.1
 5 California             38993940  1861         4.8
 6 Colorado                5448819   176         3.2
 7 Connecticut             3584730   117         3.3
 8 Delaware                 944076    63         6.7
 9 District of Columbia     670377   162        24.2
10 Florida                20244914  1041         5.1
# ℹ 41 more rows

 Group Activity 2


  • Please do the remaining problems in the class activity.
  • Submit to Gradescope on moodle when done!

15:00