L8 - Hierarchical & Web Data

Lecture Goals

Understand hierarchical/tree data
- Read data in XML format
Scrape data from the web
- Read data from HTML documents
Readings
- XML Basics
- Data Scraping example in R

Hierarchical Data

Data with tree-like structure
- Collection of nested nodes containing information and/or other nodes
- E.g Company's organizational structure

Markup Languages

Hierarchical info can be described with markup languages
- Most general is XML (Extensible Markup Language)
- Others include HTLM (Hyper-Text Markup Language), Mardkdown and Latex

Sample XML

<CEO >
  <COO> 
    <Op_Manager> ... </Op_Manager>
  </COO>
  <CMO> ... </CMO> 
  <CFO> ... </CFO> 
</CEO>

XML Basics

XML documents consist of nested elements, some with attributes

Elements demarcated by <element_name> ... </element_name>
Attributes given by key-value pairs <person gender="female">

Element content can be data or other elements

<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
  <book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
  </book>
  ...
</bookstore>

XML TREE

(https://www.w3schools.com/xml/xml_syntax.asp)

XML in R

xml2 package for working with XML data

library(xml2)
bookstore = read_xml("./data/bookstore.xml")
class( bookstore )
## [1] "xml_document" "xml_node"
bookstore
## {xml_document}
## <bookstore>
## [1] <book category="cooking">\n  <title lang="en">Everyday Italian</titl ...
## [2] <book category="children">\n  <title lang="en">Harry Potter</title>\ ...
## [3] <book category="web">\n  <title lang="en">XQuery Kick Start</title>\ ...
## [4] <book category="web" cover="paperback">\n  <title lang="en">Learning ...

XPath

XPath uses path-like expressions to access nodes (elements or attributes) in an XML document

expression	selects
`/bookstore`	root bookstore element
`//book`	all book elements, anywhere in document
`/bookstore/book[last()-1]`	next-to-last book element
`//@lang`	all lang attributes
`/[@]`	any element with any attribute
`//title[@lang='en']`	all title elements with attribute lang='en'

(https://www.w3schools.com/xml/xpath_syntax.asp)

`xml2` Functions

xml_structure() shows structure of XML doc
xml_find_first/all("XPath") finds first/all nodes described by XPath expression
xml_attr() retrieves node attribute
xml_text() retrieve node content as text

Example

bookstore %>% 
  xml_find_all( "//book[price<40]/author") %>% 
  xml_text()
## [1] "Giada De Laurentiis" "J K. Rowling"        "Erik T. Ray"

bookstore %>% 
  xml_find_all( "//book[@category='web']/title") %>% 
  xml_attr("lang")
## [1] "en" "en"

Web Data

Ways to extract information from the web
- Dedicated interface
- Scraping webpages

Web Scrapping

All webpages expressed as HTML documents
- Browsers interpret HTML and present content
- Similar to XML, and can be parsed the same way

<!DOCTYPE html>
<html>
<head>
  <title>STAA57 - Hierarchical & Web Data</title>
  <meta charset="utf-8">
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...

Example

Search workopolis.ca for Data Science jobs in Toronto https://www.workopolis.com/jobsearch/find-jobs?ak=data+scientist&l=Toronto%2C+ON

Web Scrapping in R

rvest package offers similar functionality to xml2 for HTML
Common workflow
- Read document tree with read_html()
- Select nodes with html_nodes()
- Extract info with html_text/attr()

Example

library(rvest)
URL = "https://www.workopolis.com/jobsearch/find-jobs?ak=data+scientist&l=Toronto%2C+ON"
read_html(URL) %>% 
  html_nodes(xpath = "//*[@class='JobCard-title']") %>% 
  html_attr("title") %>% head()

## [1] "Data Scientist - AI/ML"                              
## [2] "Data Scientist"                                      
## [3] "Data Team Lead"                                      
## [4] "Data Scientist, for Advanced Analytics with Big Data"
## [5] "Research Scientist, Google AI"                       
## [6] "Research Scientist, Google Brain"

Navigating HTML

View source or inspect elements of webpage in your browser
- Available in Firefox/Chrome/Edge/Safari

CSS Selectors

HTML documents have standard elements & structure
CSS selectors are easier way to navigate them

expression	selects
`body`	all <body> elements
`.title`	all elements with class = 'title'
`#fname`	all elements with id = 'fname'
`[title]`	all elements with title attribute

(https://www.w3schools.com/csSref/css_selectors.asp)

Example

Same as before, using CSS selector

read_html(URL) %>% 
  html_nodes(css = ".JobCard-title") %>% 
  html_attr("title") %>% head()
## [1] "Data Scientist - AI/ML"                              
## [2] "Data Scientist"                                      
## [3] "Data Team Lead"                                      
## [4] "Data Scientist, for Advanced Analytics with Big Data"
## [5] "Research Scientist, Google AI"                       
## [6] "Research Scientist, Google Brain"

Lecture Goals

Hierarchical Data

Markup Languages

XML Basics

XML TREE

XML in R

XPath

xml2 Functions

Example

Web Data

Web Scrapping

Example

Web Scrapping in R

Example

Navigating HTML

CSS Selectors

Example

`xml2` Functions