Advanced String Manipulation

STAT 220

Bastola

Last time: Quantifiers and Special Characters

Preceding characters are matched …

* = 0 or more
? = 0 or 1
+ = 1 or more
{n} = exactly n times

Matching character types

\\d = digit
\\s = white space
\\w = alphanumeric
\\t = tab
\\n = newline

stringr cheatsheet

More quantifiers

useful when you want to match a pattern a specific number of times

{n, } = n or more times
{, m} = at most m times
{n, m} = between n & m times

Alternatives

useful for matching patterns more flexibly

[abc] = one of a, b, or c
[e-z] = a letter from e to z
[^abc] = anything other than a, b, or c

Duplicating Groups

Use escaped numbers (\\1, \\2, etc) to repeat a group based on position

Which numbers have the same 1st and 3rd digits?

phone_numbers <- c("515 111 2244", 
                   "310 549 6892", 
                   "474 234 7548")
str_view(phone_numbers, "(\\d)\\d\\1")

[1] │ <515> <111> 2244
[3] │ <474> 234 7548

Explanation

(\\d) matches a single digit (from 0 to 9) and captures it into a capturing group. \\d matches another single digit (from 0 to 9). \\1 matches the same digit as the first captured group.

`str_view_all()`

name_phone <- c("Moly Robins: 250-999-8878",  
                "Ali Duluth: 416-908-2044",  
                "Eli Mitchell: 204.192.9829", 
                "May Flowers: 250.209.7047")

str_view_all(name_phone,
             pattern = "([2-9][0-9]{2})[.-]([0-9]{3})[.-]([0-9]{4})")

[1] │ Moly Robins: <250-999-8878>
[2] │ Ali Duluth: <416-908-2044>
[3] │ Eli Mitchell: <204.192.9829>
[4] │ May Flowers: <250.209.7047>

Explanation

([2-9][0-9]{2}) captures the area code (3 digits), ([0-9]{3}) captures the next 3 digits, ([0-9]{4}) captures the last 4 digits

`str_replace_all()`

str_replace_all(name_phone,
pattern = "([2-9][0-9]{2})[.-]([0-9]{3})[.-]([0-9]{4})",
replacement = "XXX-XXX-XXXX"
)

[1] "Moly Robins: XXX-XXX-XXXX"  "Ali Duluth: XXX-XXX-XXXX"  
[3] "Eli Mitchell: XXX-XXX-XXXX" "May Flowers: XXX-XXX-XXXX"

str_replace_all(name_phone,
                pattern = "([2-9][0-9]{2})[.-]([0-9]{3})[.-]([0-9]{4})",
                replacement = "\\1-\\2-XXXX")

[1] "Moly Robins: 250-999-XXXX"  "Ali Duluth: 416-908-XXXX"  
[3] "Eli Mitchell: 204-192-XXXX" "May Flowers: 250-209-XXXX"

`str_extract_all()`

name_phone <- c("Moly Robins: 250-999-8878", 
                "Ali Duluth: 416-908-2044", 
                "Eli Mitchell: 204-192-9829", 
                "May Flowers: 250-209-7047")

str_extract_all(name_phone, "[:alpha:]{2,}", simplify = TRUE)

     [,1]   [,2]      
[1,] "Moly" "Robins"  
[2,] "Ali"  "Duluth"  
[3,] "Eli"  "Mitchell"
[4,] "May"  "Flowers"

Repetition

aboutMe <- c("my SSN is 536-76-9423 and my age is 55")

Repetition using ?

str_view_all(aboutMe, "\\s\\d?") # space followed by 0 or 1 digit
## [1] │ my< >SSN< >is< 5>36-76-9423< >and< >my< >age< >is< 5>5

Repetition using +

str_view_all(aboutMe, "\\s\\d+")  # space followed by 1 or more digits
## [1] │ my SSN is< 536>-76-9423 and my age is< 55>

Repetition using *

str_view_all(aboutMe, "\\s\\d*")  # space followed by 0 or more digits
## [1] │ my< >SSN< >is< 536>-76-9423< >and< >my< >age< >is< 55>

Case conversion

str_to_lower("BEAUTY is in the EYE of the BEHOLDER")

[1] "beauty is in the eye of the beholder"

str_to_upper("one small step for man, one giant leap for mankind")

[1] "ONE SMALL STEP FOR MAN, ONE GIANT LEAP FOR MANKIND"

str_to_title("Aspire to inspire before we expire")

[1] "Aspire To Inspire Before We Expire"

str_to_sentence("everything you can imagine is real")

[1] "Everything you can imagine is real"

Alternates: OR

aboutMe <- c("My phone number is 236-748-4508.")

str_view(aboutMe,"8|6-")

[1] │ My phone number is 23<6->74<8>-450<8>.

str_view_all(aboutMe,"(8|6)-")

[1] │ My phone number is 23<6->74<8->4508.

More Duplicating Groups

foo <- c("addidas", "racecar")

# two successions
str_view(foo, "(.)\\1")

[1] │ a<dd>idas

# strings like `xyzzyx`
str_view(foo, "(.)(.)(.).\\3\\2\\1")

[2] │ <racecar>

str_view(foo, "(.)(.)\\1")

[1] │ ad<did>as
[2] │ ra<cec>ar

Finding patterns

# find the last word in a sentence
str_view_all("it's a goat.", 
             "[a-z]+\\.")

[1] │ it's a <goat.>

# find word with  `'s`
str_view_all("it's a goat.", 
             "[a-z]+\\'\\w")

[1] │ <it's> a goat.

# find a single letter word separated by spaces
str_view_all("it's a goat.", "\\s\\w\\b")

[1] │ it's< a> goat.

Please clone the ca12-yourusername repository from Github
Please do the problem 1 in the class activity for today

10:00

`Lookaround` operators

Source: click here

Positive Look ahead example

Positive look ahead operator x(?=[y]) will find x when it comes before y

Negative version is x(?![y]) (x when it comes before something that isn’t y)

# 1+ letters before a period
str_view_all("it's a goat.","[a-z]+(?=[\\.])")

[1] │ it's a <goat>.

Negative Look ahead example

Positive look ahead operator x(?=[y]) will find x when it comes before y

Negative version is x(?![y]) (x when it comes before something that isn’t y)

# t NOT followed by a period
str_view_all("it's a goat.", "t(?![\\.])")

[1] │ i<t>'s a goat.

Positive Look behind example

Positive look behind operator (?<=[x])y will find y when it follows x

Negative version is (?<![x])y (y when it does not follow x)

# one or more t, if preceded by a letter
str_view_all("that is a top cat.","(?<=[a-z])t+")

[1] │ tha<t> is a top ca<t>.

Negative Look behind example