- Understand hierarchical/tree data
- Read data in XML format
- Scrape data from the web
- Read data from HTML documents
- Readings
Sample XML
<CEO >
<COO>
<Op_Manager> ... </Op_Manager>
</COO>
<CMO> ... </CMO>
<CFO> ... </CFO>
</CEO><element_name> ... </element_name><person gender="female">Element content can be data or other elements
<?xml version="1.0" encoding="UTF-8"?> <bookstore> <book category="cooking"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book> ... </bookstore>
xml2 package for working with XML datalibrary(xml2)
bookstore = read_xml("./data/bookstore.xml")
class( bookstore )
## [1] "xml_document" "xml_node"
bookstore
## {xml_document}
## <bookstore>
## [1] <book category="cooking">\n <title lang="en">Everyday Italian</titl ...
## [2] <book category="children">\n <title lang="en">Harry Potter</title>\ ...
## [3] <book category="web">\n <title lang="en">XQuery Kick Start</title>\ ...
## [4] <book category="web" cover="paperback">\n <title lang="en">Learning ...
| expression | selects |
|---|---|
/bookstore |
root bookstore element |
//book |
all book elements, anywhere in document |
/bookstore/book[last()-1] |
next-to-last book element |
//@lang |
all lang attributes |
/*[@*] |
any element with any attribute |
//title[@lang='en'] |
all title elements with attribute lang='en' |
xml2 Functionsxml_structure() shows structure of XML doc
xml_find_first/all("XPath") finds first/all nodes described by XPath expression
xml_attr() retrieves node attributexml_text() retrieve node content as text
bookstore %>%
xml_find_all( "//book[price<40]/author") %>%
xml_text()
## [1] "Giada De Laurentiis" "J K. Rowling" "Erik T. Ray"
bookstore %>%
xml_find_all( "//book[@category='web']/title") %>%
xml_attr("lang")
## [1] "en" "en"
<!DOCTYPE html> <html> <head> <title>STAA57 - Hierarchical & Web Data</title> <meta charset="utf-8"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> ...
rvest package offers similar functionality to xml2 for HTML
read_html()html_nodes()html_text/attr()library(rvest)
URL = "https://www.workopolis.com/jobsearch/find-jobs?ak=data+scientist&l=Toronto%2C+ON"
read_html(URL) %>%
html_nodes(xpath = "//*[@class='JobCard-title']") %>%
html_attr("title") %>% head()
## [1] "Data Scientist - AI/ML" ## [2] "Data Scientist" ## [3] "Data Team Lead" ## [4] "Data Scientist, for Advanced Analytics with Big Data" ## [5] "Research Scientist, Google AI" ## [6] "Research Scientist, Google Brain"
| expression | selects |
|---|---|
body |
all <body> elements |
.title |
all elements with class = 'title' |
#fname |
all elements with id = 'fname' |
[title] |
all elements with title attribute |
read_html(URL) %>%
html_nodes(css = ".JobCard-title") %>%
html_attr("title") %>% head()
## [1] "Data Scientist - AI/ML"
## [2] "Data Scientist"
## [3] "Data Team Lead"
## [4] "Data Scientist, for Advanced Analytics with Big Data"
## [5] "Research Scientist, Google AI"
## [6] "Research Scientist, Google Brain"