Lecture Goals

  • Learn how to:
    • Analyse textual data
    • Use regular expressions for string manipulation
  • Readings

Strings

  • Text represented as sequence of characters called string
    • char data type in R, indicated by double quotes (" ")
typeof("my text")
## [1] "character"
  • String different from vector of characters in R
a = "string"
b = c("v", "e", "c", "t", "o", "r")
rev(a)
## [1] "string"
rev(b)
## [1] "r" "o" "t" "c" "e" "v"

String Operations

  • Look at 4 basic string operations
    • Detection
    • Subsetting
    • Mutation
    • Joining/Splitting
  • All operations are vectorised

Dinesafe Data

dinesafe = read_csv("../data/dinesafe.csv")
details = dinesafe %>% select(INFRACTION_DETAILS)
head(details)
## # A tibble: 6 x 1
##   INFRACTION_DETAILS                                                       
##   <chr>                                                                    
## 1 Operator fail to properly wash equipment                                 
## 2 <NA>                                                                     
## 3 Fail to Hold a Valid Food Handler's Certificate. Muncipal Code Chapter 5~
## 4 Operator fail to properly wash equipment                                 
## 5 Operator fail to properly wash surfaces in rooms                         
## 6 Operate food premise - equipment not arranged to permit cleaning - Sec. 9

Detect

  • Look for matching character pattern in string
    • str_detect() returns logical index (match/no match)
    • str_which() returns integer index (row # of match)
    • str_count() returns # matches
details %>% 
  mutate( no_wash = str_detect( INFRACTION_DETAILS, 
                                "fail to properly wash" ) ) %>% 
  summarise( mean(no_wash, na.rm = TRUE) )
## # A tibble: 1 x 1
##   `mean(no_wash, na.rm = TRUE)`
##                           <dbl>
## 1                         0.248

Regular Expressions

  • Simple patterns described by fixed strings
    • E.g. sentence containing word "wash"
  • More complicated patterns described by regular expressions
    • E.g. name starting with "S"
  • Regular expressions define string templates using a combination of regular and special characters (metacharacters)

Metacharacters

  • Wildcard ( \(\rightarrow\) matching )
    • . \(\rightarrow\) any character (except newline)
    • \\s \(\rightarrow\) any whitespace
    • \\d \(\rightarrow\) any number character (0-9)
    • \\w \(\rightarrow\) any letter or number character
  • Anchors
    • ^ \(\rightarrow\) start of string
    • $ \(\rightarrow\) end of string

Metacharacters

  • Quantifiers
    • * \(\rightarrow\) zero or more
    • + \(\rightarrow\) one or more
  • Alternates
    • a|b \(\rightarrow\) a or b
    • [abc] \(\rightarrow\) a, b, or c (same as [a-c])

Metacharacters

  • Groups
    • (ab)|a \(\rightarrow\) ab or a, vs
    • a(b|a) \(\rightarrow\) ab or aa
  • Look Arounds
    • a(?=b) \(\rightarrow\) a followed by b (e.g. abba)
    • (?<=a)b \(\rightarrow\) b preceded by a (e.g. abba)

Regular Expressions

  • Test regular expression on string using str_view()/str_view_all()
name = c("Tajinder", "Mustafa", "Liu Wei")
str_view(name, "(^.a)|(.a$)")

Subset

  • Extract substrings
    • str_subset() return string with match
    • str_extract() return matching pattern
dinesafe %>% 
  distinct(ESTABLISHMENT_ID, .keep_all = TRUE) %>% 
  pull(ESTABLISHMENT_NAME) %>% 
  str_subset( pattern = "PIZZA" ) %>% length()
## [1] 492

Mutate

  • Change strings or parts thereof
    • str_replace() replace pattern with string
    • str_to_lower/upper() convert to lower-/upper-case
dinesafe %>% 
  distinct(ESTABLISHMENT_ID, .keep_all = TRUE) %>% 
  mutate( ESTABLISHMENT_NAME = str_to_upper(ESTABLISHMENT_NAME) ) %>% 
  filter( str_detect(ESTABLISHMENT_NAME, "PIZZA") ) %>% dim_desc()
## [1] "[549 x 16]"

Join

  • Combine strings across vector rows
    • str_c( , sep = " ") defines separator strings
    • collapse = "" collapses result vector into single string
str_c(1:3, name, sep = " - ")
## [1] "1 - Tajinder" "2 - Mustafa"  "3 - Liu Wei"
str_c(1:3, name, sep = " - ", collapse = ", ")
## [1] "1 - Tajinder, 2 - Mustafa, 3 - Liu Wei"

Split

  • str_split() split string along pattern
    • str_split_fixed() returns fixed # of pieces
str_split(name, " ")
## [[1]]
## [1] "Tajinder"
## 
## [[2]]
## [1] "Mustafa"
## 
## [[3]]
## [1] "Liu" "Wei"
str_split_fixed(name, "\\s", 2)
##      [,1]       [,2] 
## [1,] "Tajinder" ""   
## [2,] "Mustafa"  ""   
## [3,] "Liu"      "Wei"

More String Functions

  • str_trim() trim whitespace
  • str_pad() pad strings to constant width
  • str_trunc() truncate strings to constant width
  • str_wrap() wrap string to fixed width paragraph
str_pad(name, width = 10, side = "right")
## [1] "Tajinder  " "Mustafa   " "Liu Wei   "
str_pad(name, width = 10, side = "right") %>% 
  str_trim()
## [1] "Tajinder" "Mustafa"  "Liu Wei"