# load the necessary libraries
library(stringr)
library(dplyr)
library(readr)
Class Activity 12
In this tutorial, we will learn about string manipulations using regular expressions and the stringr
library in R. We will cover different examples and use cases to help you understand the concepts and functions related to string manipulation.
Group Activity 1
<- "My SSN is 593-29-9502 and my age is 55"
x <- "My phone number is 612-643-1539"
y <- "My old SSN number is 39532 9423."
z <- str_flatten(c(x,y,z), collapse = ". ") out
a. What characters in x
will str_view_all(x, "-..-")
find?
Click for answer
answer:
The pattern searches for a dash, followed by any two characters, followed by another dash. In x, it finds “-29-” which is a part of the SSN.
str_view_all(x, "-..-")
[1] │ My SSN is 593<-29->9502 and my age is 55
b. What pattern will str_view_all(x, "-\\d{2}-")
find?
Click for answer
answer:
The pattern searches for a dash, followed by two digits, followed by another dash. In x, it finds the same “-29-” as in the previous example, which is a part of the SSN.
str_view_all(x, "-\\d{2}-") # "-" then 2 digits then "-"
[1] │ My SSN is 593<-29->9502 and my age is 55
c. What pattern will str_view_all(out, "\\d{2}\\.*")
find?
Click for answer
answer:
The pattern searches for two digits followed by an optional period. In out, it finds “55” and “55.”, which represent the age in the first sentence.
str_view_all(out, "\\s\\d{2}\\.") # 2 digits then "."
[1] │ My SSN is 593-29-9502 and my age is< 55.> My phone number is 612-643-1539. My old SSN number is 39532 9423.
d. Use str_view_all
to determine the correct regex pattern to identify all SSN in out
We can get the SSN with the usual format (###-##-####) with a regex that has 3, 2, and 4 digits separated by a dash.
str_view_all(out,"([0-8]\\d{2})-(\\d{2})-(\\d{4})")
[1] │ My SSN is <593-29-9502> and my age is 55. My phone number is 612-643-1539. My old SSN number is 39532 9423.
This misses the oddly formatted SSN in the third entry. Rather than use a dash, we can specify the divider as [-\\s]?
which allows either 0 or 1 occurrences of either a dash or space divider:
str_view_all(out,"([0-8]\\d{2})[-\\s]?(\\d{2})[-\\s]?(\\d{4})")
[1] │ My SSN is <593-29-9502> and my age is 55. My phone number is 612-643-1539. My old SSN number is <39532 9423>.
Click for answer
answer:
The first pattern finds the SSNs in the standard format (###-##-####) by searching for 3 digits, a dash, 2 digits, another dash, and 4 digits. The second pattern does the same but allows for a space instead of a dash as a divider. It finds all SSNs in out, including the oddly formatted one in the third sentence.
e. Write a regular expression to extract dates in the format YYYY-MM-DD from a given text.
<- "\\d{4}-\\d{2}-\\d{2}"
date_pattern <- "The event will take place on 2023-07-20 and end on 2023-07-22."
text str_extract_all(text, date_pattern)
[[1]]
[1] "2023-07-20" "2023-07-22"
Click for answer
Answer: The pattern searches for 4 digits, a dash, 2 digits, another dash, and 2 digits. In the given text, it finds the dates “2023-07-20” and “2023-07-22”.
f. Write a regular expression to extract all words that start with a capital letter in a given text.
<- "\\b[A-Z][a-zA-Z]*\\b"
capital_pattern <- "Alice and Bob went to the Market to buy some Groceries."
text str_extract_all(text, capital_pattern)
[[1]]
[1] "Alice" "Bob" "Market" "Groceries"
Click for answer
Answer: The pattern searches for a word boundary, followed by an uppercase letter, and then any sequence of letters. In the given text, it finds the words “Alice”, “Bob”, “Market”, and “Groceries”.
Group Activity 2
Consider the following string.
<- "100 dollars 100 pesos" string1
a. Explain why the following matches the first 100 and not the second.
Click for answer
answer: It looks for one or more digits followed by a space and dollars
str_view(string1, "\\d+(?= dollars)")
[1] │ <100> dollars 100 pesos
b. Explain why the following matches the second 100 and not the first.
Click for answer
answer: It looks for one or more digits not followed by either a digit or space followed by dollars
str_view(string1, "\\d+(?!\\d| dollars)")
[1] │ 100 dollars <100> pesos
For parts c and d, please take a look at string2
.
<- "USD100 PESO100" string2
c. Explain why the following matches the first 100 and not the second.
Click for answer
answer: It looks for exactly 3 digits preceded by USD
str_view(string2, "(?<=USD)\\d{3}")
[1] │ USD<100> PESO100
d. Explain why the following matches the second 100 and not the first.
Click for answer
answer: It looks for exactly 3 digits that is not preceded by USD
str_view(string2, "(?<!USD)\\d{3}")
[1] │ USD100 PESO<100>
Group Activity 3
Now, we will use TrumpTweetData.csv
, a dataset of tweets to demonstrate common tasks such as detecting patterns, filtering text, and summarizing string properties.
<- read_csv("https://raw.githubusercontent.com/deepbas/statdatasets/main/TrumpTweetData.csv") tweets
a. What proportion of tweets (text) mention “America”?
%>%
tweets summarize(prop = mean(str_detect(str_to_title(text), "America")))
# A tibble: 1 × 1
prop
<dbl>
1 0.0926
Click for answer
Answer: About 10% of tweets mention “America”.
b. What proportion of these tweets include “great”?
%>% filter(str_detect(str_to_title(text), "America")) %>%
tweets summarize(prop = mean(str_detect(str_to_lower(text), "great")))
# A tibble: 1 × 1
prop
<dbl>
1 0.4
Click for answer
Answer: About 40% of tweets mention “great”.
c. What proportion of the tweets mention @
?
%>% mutate(ct = str_count(text, "@")) %>%
tweets select(text, ct) %>%
summarize(prop = mean(ct>0))
# A tibble: 1 × 1
prop
<dbl>
1 0.317
Click for answer
Answer: About 32% of tweets mention @
.
d. Remove the tweets having mentions @
.
<- c("@[^\\s]+")
Mentions
<- tweets %>% mutate(textNoMentions = str_replace_all(text, Mentions, ""))
tw_noMentions $text[38] tw_noMentions
[1] "My daughter @IvankaTrump will be on @Greta tonight at 7pm. Enjoy! https://t.co/QySC5PLFMy"
$textNoMentions[38] tw_noMentions
[1] "My daughter will be on tonight at 7pm. Enjoy! https://t.co/QySC5PLFMy"
Click for answer
Answer: @:
This part of the pattern matches the “@” symbol, which usually indicates the beginning of a mention in a tweet. [^\s]+: This part of the pattern matches one or more characters that are NOT whitespaces. The ^
inside the square brackets [ ]
negates the character class (meaning it matches any character that is NOT in the specified class). The double backslash \\
is used to escape the backslash in the R string, so the pattern \\s
represents the whitespace character class \s
. Finally, the + indicates that the pattern should match one or more occurrences of the non-whitespace characters. Together, this regular expression pattern @[^\\s]+
matches any mention in a tweet, which usually starts with “@” followed by one or more non-whitespace characters.
e. What poportion of tweets originated from an iPhone?
%>% group_by(source) %>% summarize(count = n()) %>%
tweets mutate(prop = count / sum(count)) %>% filter(source == "iPhone")
# A tibble: 1 × 3
source count prop
<chr> <int> <dbl>
1 iPhone 628 0.415
Click for answer
Answer: About 42% of the tweets originated from an iPhone.
f. (Optional) Let’s deal with a number string that is longer than 9 digits.
<- "([0-8]\\d{2})[-\\s]?(\\d{2})[-\\s]?(\\d{4})"
ssn <- c("123-45-67890","1123 45 6789")
test str_view_all(test, ssn)
[1] │ <123-45-6789>0
[2] │ 1<123 45 6789>
This example captures a 9-digit string as an SSN, but these strings are longer than 9 digits and may not represent an SSN. One way to deal with this is to use the negative lookbehind ?<!
and negative lookahead ?!
operators to ensure that the identified 9-digit string does not have a leading 0 or does not contain more digits.
If we “look behind” from the start of the SSN, we should not see another digit:
str_view_all(test, "(?<!\\d)([0-8]\\d{2})[-\\.\\s]?(\\d{2})[-\\.\\s]?(\\d{4})")
[1] │ <123-45-6789>0
[2] │ 1123 45 6789
And if we “look ahead” from the end of the SSN, we should not see another digit:
str_view_all(test, "(?<!\\d)([0-8]\\d{2})[-\\.\\s]?(\\d{2})[-\\.\\s]?(\\d{4})(?!\\d)")
[1] │ 123-45-67890
[2] │ 1123 45 6789