Goal: Extract XML/HTML data from the web.
The City of Toronto’s Open Data portal provides data in various formats, both tabular (CSV, XLS) and hierarchical (XML, JSON). For the next questions, we will look at data on festivals & events that appear on the city’s calendar (see here for more details).
- Historical data on the city’s festivals and events are provided in the XML file: https://www.toronto.ca/ext/open_data/catalog/data_set_files/Festivals_and_Events_v9_fromArchivedb.xml
Use the xml2
package to read the file into R, and use xml_child()
and xml_structure()
to print the structure of the first child element of the XML document.
The structure of the XML document is shown below:
The root node <viewentries> contains several <viewentry> elements, where each <viewentry> element represents a festival or event . Within each <viewentry> there are multiple <entrydata> elements with different name attributes that represent the field/variable name. And within each <entrydata> there is one or more <text> elements containing the field value(s).
- Extract all event names, i.e. all <text> values from <entrydata> elements with attribute name=‘EventName’.
- Extract all event names and locations (name=‘Location’) and combine them in a single dataframe. Then find the distinct names of all events whose location is ‘Nathan Phillips Square’.
We are now going to extract data on “Data Scientist” jobs from Workopolis. Notice that every result has a full description that can be presented on the right side of the page.
The HTML element for each job title contains a link to this description as an href attribute:
<a class="JobCard-titleLink" ... href="/jobsearch/viewjob/PNCqvtT4md..." >
... </a>
- Read the HTML document of all “Data Scientist” positions in Toronto, and extract the URL with the full job description; save the results in a character vector.
- Open and inspect the first link in your web-browser; note that you have to add “https://www.workopolis.com” at the beginning of each href string. Then, read the HTML document of this link in R, and extract the job description text.
- Go over all job links (use a
for
loop) and extract all job descriptions. Compare the number of times that “R” vs “Python” is mentioned in the description to find the most popular language (use string functions & regular expressions).
- [EXTRA] Note that the webpage (https://www.workopolis.com/jobsearch/find-jobs?ak=data+scientist&l=Toronto%2C) only shows the first 29 search results, but there are more results pages that you can visit through the links at the bottom.
So, if you want to collect information for all job postings, you have to visit each results page separately. You can access different results pages by adding a simple string to the URL; e.g. to access the 3rd results page, add &pn=3
at the end of the web address:
(https://www.workopolis.com/jobsearch/find-jobs?ak=data+scientist&l=Toronto%2C+ON&pn=3) Use this approach, together with a while()
loop, to collect the job titles and companies of all search results.
LS0tDQp0aXRsZTogIlNUQUE1NyAtIFdvcmtTaGVldCA4Ig0KYXV0aG9yOiAnTmFtZTogICAgLCBJRCM6ICAgJw0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KKipHb2FsKio6IEV4dHJhY3QgWE1ML0hUTUwgZGF0YSBmcm9tIHRoZSB3ZWIuDQoNClRoZSBDaXR5IG9mIFRvcm9udG8ncyBPcGVuIERhdGEgcG9ydGFsIHByb3ZpZGVzIGRhdGEgaW4gdmFyaW91cyBmb3JtYXRzLCBib3RoIHRhYnVsYXIgKENTViwgWExTKSBhbmQgaGllcmFyY2hpY2FsIChYTUwsIEpTT04pLiBGb3IgdGhlIG5leHQgcXVlc3Rpb25zLCB3ZSB3aWxsIGxvb2sgYXQgZGF0YSBvbiBmZXN0aXZhbHMgJiBldmVudHMgdGhhdCBhcHBlYXIgb24gdGhlIGNpdHkncyBjYWxlbmRhciAoc2VlIFtoZXJlXSgoaHR0cHM6Ly93d3cudG9yb250by5jYS9jaXR5LWdvdmVybm1lbnQvZGF0YS1yZXNlYXJjaC1tYXBzL29wZW4tZGF0YS9vcGVuLWRhdGEtY2F0YWxvZ3VlLyMyMWRkODIwZC1kYzdmLTczZDUtYTZmMC02MzY4YjcwYTFiNmYpIA0KKSBmb3IgbW9yZSBkZXRhaWxzKS4gIA0KDQoNCjEuIEhpc3RvcmljYWwgZGF0YSBvbiB0aGUgY2l0eSdzIGZlc3RpdmFscyBhbmQgZXZlbnRzIGFyZSBwcm92aWRlZCBpbiB0aGUgWE1MIGZpbGU6IGh0dHBzOi8vd3d3LnRvcm9udG8uY2EvZXh0L29wZW5fZGF0YS9jYXRhbG9nL2RhdGFfc2V0X2ZpbGVzL0Zlc3RpdmFsc19hbmRfRXZlbnRzX3Y5X2Zyb21BcmNoaXZlZGIueG1sICAgIA0KVXNlIHRoZSBgeG1sMmAgcGFja2FnZSB0byByZWFkIHRoZSBmaWxlIGludG8gUiwgYW5kIHVzZSBgeG1sX2NoaWxkKClgIGFuZCBgeG1sX3N0cnVjdHVyZSgpYCB0byBwcmludCB0aGUgc3RydWN0dXJlIG9mIHRoZSBmaXJzdCBjaGlsZCBlbGVtZW50IG9mIHRoZSBYTUwgZG9jdW1lbnQuDQoNCmBgYHtyfQ0KYGBgDQoNClRoZSBzdHJ1Y3R1cmUgb2YgdGhlIFhNTCBkb2N1bWVudCBpcyBzaG93biBiZWxvdzoNCiFbXSguL2ltZy94bWxfZmV2ZW50cy5QTkcpICAgDQpUaGUgcm9vdCBub2RlIFw8dmlld2VudHJpZXNcPiBjb250YWlucyBzZXZlcmFsIFw8dmlld2VudHJ5XD4gZWxlbWVudHMsIHdoZXJlIGVhY2ggXDx2aWV3ZW50cnlcPiBlbGVtZW50IHJlcHJlc2VudHMgYSBmZXN0aXZhbCBvciBldmVudCAuIFdpdGhpbiBlYWNoIFw8dmlld2VudHJ5XD4gdGhlcmUgYXJlIG11bHRpcGxlIFw8ZW50cnlkYXRhXD4gZWxlbWVudHMgd2l0aCBkaWZmZXJlbnQgKm5hbWUqIGF0dHJpYnV0ZXMgdGhhdCByZXByZXNlbnQgdGhlIGZpZWxkL3ZhcmlhYmxlIG5hbWUuIEFuZCB3aXRoaW4gZWFjaCBcPGVudHJ5ZGF0YVw+IHRoZXJlIGlzIG9uZSBvciBtb3JlIFw8dGV4dFw+IGVsZW1lbnRzIGNvbnRhaW5pbmcgdGhlIGZpZWxkIHZhbHVlKHMpLiAgDQoNCg0KMi4gRXh0cmFjdCBhbGwgZXZlbnQgbmFtZXMsIGkuZS4gYWxsIFw8dGV4dFw+IHZhbHVlcyBmcm9tIFw8ZW50cnlkYXRhXD4gZWxlbWVudHMgd2l0aCBhdHRyaWJ1dGUgKm5hbWUqPSdFdmVudE5hbWUnLg0KDQpgYGB7cn0NCmBgYA0KDQozLiBFeHRyYWN0IGFsbCBldmVudCBuYW1lcyBhbmQgbG9jYXRpb25zICgqbmFtZSo9J0xvY2F0aW9uJykgYW5kIGNvbWJpbmUgdGhlbSBpbiBhIHNpbmdsZSBkYXRhZnJhbWUuIFRoZW4gZmluZCB0aGUgZGlzdGluY3QgbmFtZXMgb2YgYWxsIGV2ZW50cyB3aG9zZSBsb2NhdGlvbiBpcyAnTmF0aGFuIFBoaWxsaXBzIFNxdWFyZScuDQoNCmBgYHtyfQ0KYGBgDQoNCg0KV2UgYXJlIG5vdyBnb2luZyB0byBleHRyYWN0IGRhdGEgb24gIkRhdGEgU2NpZW50aXN0IiBqb2JzIGZyb20gV29ya29wb2xpcy4gTm90aWNlIHRoYXQgZXZlcnkgcmVzdWx0IGhhcyBhIGZ1bGwgZGVzY3JpcHRpb24gdGhhdCBjYW4gYmUgcHJlc2VudGVkIG9uIHRoZSByaWdodCBzaWRlIG9mIHRoZSBwYWdlLg0KIVtdKC4vaW1nL3dvcmtvcG9saXMuUE5HKSAgIA0KVGhlIEhUTUwgZWxlbWVudCBmb3IgZWFjaCBqb2IgdGl0bGUgY29udGFpbnMgYSBsaW5rIHRvIHRoaXMgZGVzY3JpcHRpb24gYXMgYW4gKmhyZWYqIGF0dHJpYnV0ZToNCmBgYA0KPGEgY2xhc3M9IkpvYkNhcmQtdGl0bGVMaW5rIiAuLi4gaHJlZj0iL2pvYnNlYXJjaC92aWV3am9iL1BOQ3F2dFQ0bWQuLi4iID4gDQouLi4gPC9hPg0KYGBgDQo0LiBSZWFkIHRoZSBIVE1MIGRvY3VtZW50IG9mIGFsbCAiRGF0YSBTY2llbnRpc3QiIHBvc2l0aW9ucyBpbiBUb3JvbnRvLCBhbmQgZXh0cmFjdCB0aGUgVVJMIHdpdGggdGhlIGZ1bGwgam9iIGRlc2NyaXB0aW9uOyBzYXZlIHRoZSByZXN1bHRzIGluIGEgY2hhcmFjdGVyIHZlY3Rvci4NCg0KYGBge3J9DQpgYGANCg0KNS4gT3BlbiBhbmQgaW5zcGVjdCB0aGUgZmlyc3QgbGluayBpbiB5b3VyIHdlYi1icm93c2VyOyBub3RlIHRoYXQgeW91IGhhdmUgdG8gYWRkICJodHRwczovL3d3dy53b3Jrb3BvbGlzLmNvbSIgYXQgdGhlIGJlZ2lubmluZyBvZiBlYWNoICpocmVmKiBzdHJpbmcuIFRoZW4sIHJlYWQgdGhlIEhUTUwgZG9jdW1lbnQgb2YgdGhpcyBsaW5rIGluIFIsIGFuZCBleHRyYWN0IHRoZSBqb2IgZGVzY3JpcHRpb24gdGV4dC4NCg0KYGBge3J9DQpgYGANCg0KNi4gR28gb3ZlciBhbGwgam9iIGxpbmtzICh1c2UgYSBgZm9yYCBsb29wKSBhbmQgZXh0cmFjdCBhbGwgam9iIGRlc2NyaXB0aW9ucy4gQ29tcGFyZSB0aGUgbnVtYmVyIG9mIHRpbWVzIHRoYXQgIlIiIHZzICJQeXRob24iIGlzIG1lbnRpb25lZCBpbiB0aGUgZGVzY3JpcHRpb24gdG8gZmluZCB0aGUgbW9zdCBwb3B1bGFyIGxhbmd1YWdlICh1c2Ugc3RyaW5nIGZ1bmN0aW9ucyAmIHJlZ3VsYXIgZXhwcmVzc2lvbnMpLiANCg0KYGBge3J9DQoNCmBgYA0KDQoNCjcuIFtFWFRSQV0gTm90ZSB0aGF0IHRoZSB3ZWJwYWdlIChodHRwczovL3d3dy53b3Jrb3BvbGlzLmNvbS9qb2JzZWFyY2gvZmluZC1qb2JzP2FrPWRhdGErc2NpZW50aXN0Jmw9VG9yb250byUyQykgb25seSBzaG93cyB0aGUgZmlyc3QgMjkgc2VhcmNoIHJlc3VsdHMsIGJ1dCB0aGVyZSBhcmUgbW9yZSByZXN1bHRzIHBhZ2VzIHRoYXQgeW91IGNhbiB2aXNpdCB0aHJvdWdoIHRoZSBsaW5rcyBhdCB0aGUgYm90dG9tLiAgDQohW10oLi9pbWcvcmVzdWx0c1BhZ2VzLlBORykgICANClNvLCBpZiB5b3Ugd2FudCB0byBjb2xsZWN0IGluZm9ybWF0aW9uIGZvciAqYWxsKiBqb2IgcG9zdGluZ3MsIHlvdSBoYXZlIHRvIHZpc2l0IGVhY2ggcmVzdWx0cyBwYWdlIHNlcGFyYXRlbHkuIFlvdSBjYW4gYWNjZXNzIGRpZmZlcmVudCByZXN1bHRzIHBhZ2VzIGJ5IGFkZGluZyBhIHNpbXBsZSBzdHJpbmcgdG8gdGhlIFVSTDsgZS5nLiB0byBhY2Nlc3MgdGhlIDNyZCByZXN1bHRzIHBhZ2UsIGFkZCBgJnBuPTNgIGF0IHRoZSBlbmQgb2YgdGhlIHdlYiBhZGRyZXNzOiAgDQooaHR0cHM6Ly93d3cud29ya29wb2xpcy5jb20vam9ic2VhcmNoL2ZpbmQtam9icz9haz1kYXRhK3NjaWVudGlzdCZsPVRvcm9udG8lMkMrT04mcG49MykNClVzZSB0aGlzIGFwcHJvYWNoLCB0b2dldGhlciB3aXRoIGEgYHdoaWxlKClgIGxvb3AsIHRvIGNvbGxlY3QgdGhlIGpvYiB0aXRsZXMgYW5kIGNvbXBhbmllcyBvZiAqYWxsKiBzZWFyY2ggcmVzdWx0cy4gDQoNCg0KDQo=