L7 - Text Manipulation

Lecture Goals

Learn how to:
- Analyse textual data
- Use regular expressions for string manipulation
Readings
- R4DS: ch. 14
- String Manipulation cheatsheet

Strings

Text represented as sequence of characters called string
- char data type in R, indicated by double quotes (" ")

typeof("my text")
## [1] "character"

String different from vector of characters in R

a = "string"
b = c("v", "e", "c", "t", "o", "r")
rev(a)
## [1] "string"
rev(b)
## [1] "r" "o" "t" "c" "e" "v"

String Operations

Look at 4 basic string operations
- Detection
- Subsetting
- Mutation
- Joining/Splitting
All operations are vectorised

Dinesafe Data

dinesafe = read_csv("../data/dinesafe.csv")
details = dinesafe %>% select(INFRACTION_DETAILS)
head(details)
## # A tibble: 6 x 1
##   INFRACTION_DETAILS                                                       
##   <chr>                                                                    
## 1 Operator fail to properly wash equipment                                 
## 2 <NA>                                                                     
## 3 Fail to Hold a Valid Food Handler's Certificate. Muncipal Code Chapter 5~
## 4 Operator fail to properly wash equipment                                 
## 5 Operator fail to properly wash surfaces in rooms                         
## 6 Operate food premise - equipment not arranged to permit cleaning - Sec. 9

Detect

Look for matching character pattern in string
- str_detect() returns logical index (match/no match)
- str_which() returns integer index (row # of match)
- str_count() returns # matches

details %>% 
  mutate( no_wash = str_detect( INFRACTION_DETAILS, 
                                "fail to properly wash" ) ) %>% 
  summarise( mean(no_wash, na.rm = TRUE) )
## # A tibble: 1 x 1
##   `mean(no_wash, na.rm = TRUE)`
##                           <dbl>
## 1                         0.248

Regular Expressions

Simple patterns described by fixed strings
- E.g. sentence containing word "wash"
More complicated patterns described by regular expressions
- E.g. name starting with "S"
Regular expressions define string templates using a combination of regular and special characters (metacharacters)

Metacharacters

Wildcard ( $\rightarrow$ matching )
- . $\rightarrow$ any character (except newline)
- \\s $\rightarrow$ any whitespace
- \\d $\rightarrow$ any number character (0-9)
- \\w $\rightarrow$ any letter or number character
Anchors
- ^ $\rightarrow$ start of string
- $ $\rightarrow$ end of string

Metacharacters

Quantifiers
- * $\rightarrow$ zero or more
- + $\rightarrow$ one or more
Alternates
- a|b $\rightarrow$ a or b
- [abc] $\rightarrow$ a, b, or c (same as [a-c])

Metacharacters

Groups
- (ab)|a $\rightarrow$ ab or a, vs
- a(b|a) $\rightarrow$ ab or aa
Look Arounds
- a(?=b) $\rightarrow$ a followed by b (e.g. abba)
- (?<=a)b $\rightarrow$ b preceded by a (e.g. abba)

Regular Expressions

Test regular expression on string using str_view()/str_view_all()

name = c("Tajinder", "Mustafa", "Liu Wei")
str_view(name, "(^.a)|(.a$)")

Subset

Extract substrings
- str_subset() return string with match
- str_extract() return matching pattern

dinesafe %>% 
  distinct(ESTABLISHMENT_ID, .keep_all = TRUE) %>% 
  pull(ESTABLISHMENT_NAME) %>% 
  str_subset( pattern = "PIZZA" ) %>% length()
## [1] 492

Mutate

Change strings or parts thereof
- str_replace() replace pattern with string
- str_to_lower/upper() convert to lower-/upper-case

dinesafe %>% 
  distinct(ESTABLISHMENT_ID, .keep_all = TRUE) %>% 
  mutate( ESTABLISHMENT_NAME = str_to_upper(ESTABLISHMENT_NAME) ) %>% 
  filter( str_detect(ESTABLISHMENT_NAME, "PIZZA") ) %>% dim_desc()
## [1] "[549 x 16]"

Join

Combine strings across vector rows
- str_c( , sep = " ") defines separator strings
- collapse = "" collapses result vector into single string

str_c(1:3, name, sep = " - ")
## [1] "1 - Tajinder" "2 - Mustafa"  "3 - Liu Wei"
str_c(1:3, name, sep = " - ", collapse = ", ")
## [1] "1 - Tajinder, 2 - Mustafa, 3 - Liu Wei"

Split

str_split() split string along pattern
- str_split_fixed() returns fixed # of pieces

str_split(name, " ")
## [[1]]
## [1] "Tajinder"
## 
## [[2]]
## [1] "Mustafa"
## 
## [[3]]
## [1] "Liu" "Wei"
str_split_fixed(name, "\\s", 2)
##      [,1]       [,2] 
## [1,] "Tajinder" ""   
## [2,] "Mustafa"  ""   
## [3,] "Liu"      "Wei"

More String Functions

str_trim() trim whitespace
str_pad() pad strings to constant width
str_trunc() truncate strings to constant width
str_wrap() wrap string to fixed width paragraph

str_pad(name, width = 10, side = "right")
## [1] "Tajinder  " "Mustafa   " "Liu Wei   "
str_pad(name, width = 10, side = "right") %>% 
  str_trim()
## [1] "Tajinder" "Mustafa"  "Liu Wei"