Advanced String Manipulation

STAT 220

Bastola

Last time: Quantifiers and Special Characters

Preceding characters are matched …

  • * = 0 or more
  • ? = 0 or 1
  • + = 1 or more
  • {n} = exactly n times

Matching character types

  • \\d = digit
  • \\s = white space
  • \\w = alphanumeric
  • \\t = tab
  • \\n = newline

stringr cheatsheet

More quantifiers

useful when you want to match a pattern a specific number of times

  • {n, } = n or more times

  • {, m} = at most m times

  • {n, m} = between n & m times

Alternatives

useful for matching patterns more flexibly

  • [abc] = one of a, b, or c

  • [e-z] = a letter from e to z

  • [^abc] = anything other than a, b, or c

Duplicating Groups

Use escaped numbers (\\1, \\2, etc) to repeat a group based on position

Which numbers have the same 1st and 3rd digits?

phone_numbers <- c("515 111 2244", 
                   "310 549 6892", 
                   "474 234 7548")
str_view(phone_numbers, "(\\d)\\d\\1")
[1] │ <515> <111> 2244
[3] │ <474> 234 7548
Explanation

(\\d) matches a single digit (from 0 to 9) and captures it into a capturing group. \\d matches another single digit (from 0 to 9). \\1 matches the same digit as the first captured group.

str_view_all()

name_phone <- c("Moly Robins: 250-999-8878",  
                "Ali Duluth: 416-908-2044",  
                "Eli Mitchell: 204.192.9829", 
                "May Flowers: 250.209.7047")
str_view_all(name_phone,
             pattern = "([2-9][0-9]{2})[.-]([0-9]{3})[.-]([0-9]{4})")
[1] │ Moly Robins: <250-999-8878>
[2] │ Ali Duluth: <416-908-2044>
[3] │ Eli Mitchell: <204.192.9829>
[4] │ May Flowers: <250.209.7047>
Explanation ([2-9][0-9]{2}) captures the area code (3 digits), ([0-9]{3}) captures the next 3 digits, ([0-9]{4}) captures the last 4 digits

str_replace_all()

str_replace_all(name_phone,
pattern = "([2-9][0-9]{2})[.-]([0-9]{3})[.-]([0-9]{4})",
replacement = "XXX-XXX-XXXX"
)
[1] "Moly Robins: XXX-XXX-XXXX"  "Ali Duluth: XXX-XXX-XXXX"  
[3] "Eli Mitchell: XXX-XXX-XXXX" "May Flowers: XXX-XXX-XXXX" 
str_replace_all(name_phone,
                pattern = "([2-9][0-9]{2})[.-]([0-9]{3})[.-]([0-9]{4})",
                replacement = "\\1-\\2-XXXX")
[1] "Moly Robins: 250-999-XXXX"  "Ali Duluth: 416-908-XXXX"  
[3] "Eli Mitchell: 204-192-XXXX" "May Flowers: 250-209-XXXX" 

str_extract_all()

name_phone <- c("Moly Robins: 250-999-8878", 
                "Ali Duluth: 416-908-2044", 
                "Eli Mitchell: 204-192-9829", 
                "May Flowers: 250-209-7047")
str_extract_all(name_phone, "[:alpha:]{2,}", simplify = TRUE) 
     [,1]   [,2]      
[1,] "Moly" "Robins"  
[2,] "Ali"  "Duluth"  
[3,] "Eli"  "Mitchell"
[4,] "May"  "Flowers" 

Repetition

aboutMe <- c("my SSN is 536-76-9423 and my age is 55")

Repetition using ?

str_view_all(aboutMe, "\\s\\d?") # space followed by 0 or 1 digit
## [1] │ my< >SSN< >is< 5>36-76-9423< >and< >my< >age< >is< 5>5

Repetition using +

str_view_all(aboutMe, "\\s\\d+")  # space followed by 1 or more digits
## [1] │ my SSN is< 536>-76-9423 and my age is< 55>

Repetition using *

str_view_all(aboutMe, "\\s\\d*")  # space followed by 0 or more digits
## [1] │ my< >SSN< >is< 536>-76-9423< >and< >my< >age< >is< 55>

Case conversion

str_to_lower("BEAUTY is in the EYE of the BEHOLDER")
[1] "beauty is in the eye of the beholder"
str_to_upper("one small step for man, one giant leap for mankind")
[1] "ONE SMALL STEP FOR MAN, ONE GIANT LEAP FOR MANKIND"
str_to_title("Aspire to inspire before we expire")
[1] "Aspire To Inspire Before We Expire"
str_to_sentence("everything you can imagine is real")
[1] "Everything you can imagine is real"

Alternates: OR

aboutMe <- c("My phone number is 236-748-4508.")
str_view(aboutMe,"8|6-")  
[1] │ My phone number is 23<6->74<8>-450<8>.
str_view_all(aboutMe,"(8|6)-")  
[1] │ My phone number is 23<6->74<8->4508.

More Duplicating Groups

foo <- c("addidas", "racecar")
# two successions
str_view(foo, "(.)\\1") 
[1] │ a<dd>idas
# strings like `xyzzyx`
str_view(foo, "(.)(.)(.).\\3\\2\\1")
[2] │ <racecar>
str_view(foo, "(.)(.)\\1")
[1] │ ad<did>as
[2] │ ra<cec>ar

Finding patterns

# find the last word in a sentence
str_view_all("it's a goat.", 
             "[a-z]+\\.")
[1] │ it's a <goat.>
# find word with  `'s`
str_view_all("it's a goat.", 
             "[a-z]+\\'\\w")
[1] │ <it's> a goat.
# find a single letter word separated by spaces
str_view_all("it's a goat.", "\\s\\w\\b")
[1] │ it's< a> goat.

 Group Activity 1


  • Please clone the ca12-yourusername repository from Github
  • Please do the problem 1 in the class activity for today

10:00

Lookaround operators



Source: click here

Positive Look ahead example

Positive look ahead operator x(?=[y]) will find x when it comes before y

Negative version is x(?![y]) (x when it comes before something that isn’t y)

# 1+ letters before a period
str_view_all("it's a goat.","[a-z]+(?=[\\.])") 
[1] │ it's a <goat>.

Negative Look ahead example

Positive look ahead operator x(?=[y]) will find x when it comes before y

Negative version is x(?![y]) (x when it comes before something that isn’t y)

# t NOT followed by a period
str_view_all("it's a goat.", "t(?![\\.])")
[1] │ i<t>'s a goat.

Positive Look behind example

Positive look behind operator (?<=[x])y will find y when it follows x

Negative version is (?<![x])y (y when it does not follow x)

# one or more t, if preceded by a letter
str_view_all("that is a top cat.","(?<=[a-z])t+") 
[1] │ tha<t> is a top ca<t>.

Negative Look behind example

Positive look behind operator (?<=[x])y will find y when it follows x

Negative version is (?<![x])y (y when it does not follow x)

# t and one or more letter not preceded by a letter
str_view_all("that is a top cat.","(?<![a-z])t[a-z]+") 
[1] │ <that> is a <top> cat.

 Group Activity 2


  • Please do the remaining problems in the class activity.
  • Submit to Gradescope on moodle when done!

10:00