Lecture Goals

Hierarchical Data

  • Data with tree-like structure
    • Collection of nested nodes containing information and/or other nodes
    • E.g Company's organizational structure

Markup Languages

  • Hierarchical info can be described with markup languages
    • Most general is XML (Extensible Markup Language)
    • Others include HTLM (Hyper-Text Markup Language), Mardkdown and Latex
  • Sample XML

    <CEO >
      <COO> 
        <Op_Manager> ... </Op_Manager>
      </COO>
      <CMO> ... </CMO> 
      <CFO> ... </CFO> 
    </CEO>

XML Basics

  • XML documents consist of nested elements, some with attributes
    • Elements demarcated by <element_name> ... </element_name>
    • Attributes given by key-value pairs <person gender="female">
    • Element content can be data or other elements

      <?xml version="1.0" encoding="UTF-8"?>
      <bookstore>
        <book category="cooking">
      <title lang="en">Everyday Italian</title>
      <author>Giada De Laurentiis</author>
      <year>2005</year>
      <price>30.00</price>
        </book>
        ...
      </bookstore>  

XML TREE

XML in R

  • xml2 package for working with XML data
library(xml2)
bookstore = read_xml("./data/bookstore.xml")
class( bookstore )
## [1] "xml_document" "xml_node"
bookstore
## {xml_document}
## <bookstore>
## [1] <book category="cooking">\n  <title lang="en">Everyday Italian</titl ...
## [2] <book category="children">\n  <title lang="en">Harry Potter</title>\ ...
## [3] <book category="web">\n  <title lang="en">XQuery Kick Start</title>\ ...
## [4] <book category="web" cover="paperback">\n  <title lang="en">Learning ...

XPath

  • XPath uses path-like expressions to access nodes (elements or attributes) in an XML document
expression selects
/bookstore root bookstore element
//book all book elements, anywhere in document
/bookstore/book[last()-1] next-to-last book element
//@lang all lang attributes
/*[@*] any element with any attribute
//title[@lang='en'] all title elements with attribute lang='en'

(https://www.w3schools.com/xml/xpath_syntax.asp)

xml2 Functions

  • xml_structure() shows structure of XML doc

  • xml_find_first/all("XPath") finds first/all nodes described by XPath expression

  • xml_attr() retrieves node attribute
  • xml_text() retrieve node content as text

Example

bookstore %>% 
  xml_find_all( "//book[price<40]/author") %>% 
  xml_text()
## [1] "Giada De Laurentiis" "J K. Rowling"        "Erik T. Ray"

bookstore %>% 
  xml_find_all( "//book[@category='web']/title") %>% 
  xml_attr("lang")
## [1] "en" "en"

Web Data

  • Ways to extract information from the web
    • Dedicated interface
    • Scraping webpages

Web Scrapping

  • All webpages expressed as HTML documents
    • Browsers interpret HTML and present content
    • Similar to XML, and can be parsed the same way
<!DOCTYPE html>
<html>
<head>
  <title>STAA57 - Hierarchical & Web Data</title>
  <meta charset="utf-8">
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...

Example

Web Scrapping in R

  • rvest package offers similar functionality to xml2 for HTML

  • Common workflow
    • Read document tree with read_html()
    • Select nodes with html_nodes()
    • Extract info with html_text/attr()

Example

library(rvest)
URL = "https://www.workopolis.com/jobsearch/find-jobs?ak=data+scientist&l=Toronto%2C+ON"
read_html(URL) %>% 
  html_nodes(xpath = "//*[@class='JobCard-title']") %>% 
  html_attr("title") %>% head()
## [1] "Data Scientist - AI/ML"                              
## [2] "Data Scientist"                                      
## [3] "Data Team Lead"                                      
## [4] "Data Scientist, for Advanced Analytics with Big Data"
## [5] "Research Scientist, Google AI"                       
## [6] "Research Scientist, Google Brain"

Navigating HTML

CSS Selectors

  • HTML documents have standard elements & structure
  • CSS selectors are easier way to navigate them
expression selects
body all <body> elements
.title all elements with class = 'title'
#fname all elements with id = 'fname'
[title] all elements with title attribute

(https://www.w3schools.com/csSref/css_selectors.asp)

Example

  • Same as before, using CSS selector
read_html(URL) %>% 
  html_nodes(css = ".JobCard-title") %>% 
  html_attr("title") %>% head()
## [1] "Data Scientist - AI/ML"                              
## [2] "Data Scientist"                                      
## [3] "Data Team Lead"                                      
## [4] "Data Scientist, for Advanced Analytics with Big Data"
## [5] "Research Scientist, Google AI"                       
## [6] "Research Scientist, Google Brain"