- Understand hierarchical/tree data
- Read data in XML format
- Scrape data from the web
- Read data from HTML documents
- Readings
Sample XML
<CEO > <COO> <Op_Manager> ... </Op_Manager> </COO> <CMO> ... </CMO> <CFO> ... </CFO> </CEO>
<element_name> ... </element_name>
<person gender="female">
Element content can be data or other elements
<?xml version="1.0" encoding="UTF-8"?> <bookstore> <book category="cooking"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book> ... </bookstore>
xml2
package for working with XML datalibrary(xml2) bookstore = read_xml("./data/bookstore.xml") class( bookstore ) ## [1] "xml_document" "xml_node" bookstore ## {xml_document} ## <bookstore> ## [1] <book category="cooking">\n <title lang="en">Everyday Italian</titl ... ## [2] <book category="children">\n <title lang="en">Harry Potter</title>\ ... ## [3] <book category="web">\n <title lang="en">XQuery Kick Start</title>\ ... ## [4] <book category="web" cover="paperback">\n <title lang="en">Learning ...
expression | selects |
---|---|
/bookstore |
root bookstore element |
//book |
all book elements, anywhere in document |
/bookstore/book[last()-1] |
next-to-last book element |
//@lang |
all lang attributes |
/*[@*] |
any element with any attribute |
//title[@lang='en'] |
all title elements with attribute lang='en' |
xml2
Functionsxml_structure()
shows structure of XML doc
xml_find_first/all("XPath")
finds first/all nodes described by XPath expression
xml_attr()
retrieves node attributexml_text()
retrieve node content as text
bookstore %>% xml_find_all( "//book[price<40]/author") %>% xml_text() ## [1] "Giada De Laurentiis" "J K. Rowling" "Erik T. Ray" bookstore %>% xml_find_all( "//book[@category='web']/title") %>% xml_attr("lang") ## [1] "en" "en"
<!DOCTYPE html> <html> <head> <title>STAA57 - Hierarchical & Web Data</title> <meta charset="utf-8"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> ...
rvest
package offers similar functionality to xml2
for HTML
read_html()
html_nodes()
html_text/attr()
library(rvest) URL = "https://www.workopolis.com/jobsearch/find-jobs?ak=data+scientist&l=Toronto%2C+ON" read_html(URL) %>% html_nodes(xpath = "//*[@class='JobCard-title']") %>% html_attr("title") %>% head()
## [1] "Data Scientist - AI/ML" ## [2] "Data Scientist" ## [3] "Data Team Lead" ## [4] "Data Scientist, for Advanced Analytics with Big Data" ## [5] "Research Scientist, Google AI" ## [6] "Research Scientist, Google Brain"
expression | selects |
---|---|
body |
all <body> elements |
.title |
all elements with class = 'title' |
#fname |
all elements with id = 'fname' |
[title] |
all elements with title attribute |
read_html(URL) %>% html_nodes(css = ".JobCard-title") %>% html_attr("title") %>% head() ## [1] "Data Scientist - AI/ML" ## [2] "Data Scientist" ## [3] "Data Team Lead" ## [4] "Data Scientist, for Advanced Analytics with Big Data" ## [5] "Research Scientist, Google AI" ## [6] "Research Scientist, Google Brain"