Help with the html.nodes() function of rvest

Hi my name is Alejandro Pereira, research assistant at the Economic and Social Research Institute of the Universidad Católica Andrés Bello, Venezuela. I'm doing an algorithm in the R language to extract data from LinkedIn profiles to apply text mining and identify the skills that are being developed for the labor field.

I am using the rvest library of r, I enter the keyword (example: django) in linkedin and I get a link from the search engine to enter it in read_html() . Analyzing the html structure, I want to extract the information from the node

when I introduce the xpath in the html.nodes() function does not get the node.

library(rvest)
library(xml2)

html <- read_html("https://www.linkedin.com/search/results/people/?keywords=django&origin=SWITCH_SEARCH_VERTICAL")
content <- html_nodes(html, "div#ember5")
content

I map the node of class = "div" and I can notice that the node div#ember5 is not there.

html <- read_html("https://www.linkedin.com/search/results/people/?keywords=django&origin=SWITCH_SEARCH_VERTICAL")
content <- html_nodes(html, class ="div")
content

I don't understand why, if anyone can help or explain, I'd appreciate it. Preformatted text

Hey Alejandro,

When I download the html at that url with download_html and manually search with Ctrl+F I don't see any instances of the string 'ember-view' or 'ember5', so that might be why html_nodes(html, "div#ember5") isn't finding anything.

Unfortunately I'm not sure why you would be seeing that in your browser's view source but not after downloading the html (someone on that LinkedIn page would probably be able to explain!).

Also, just in case this comes up, I think the syntax for all div elements would be html_nodes(html, css ="div") instead of html_nodes(html, class ="div").

Hope this helps,

1 Like

Hi Gabriel, I haven't been able to solve the problem yet.

You're right div#ember5 doesn't appear in the html code it generates, the information is really in the nodes . Even so, when I track the nodes there are some that are hidden. I think it's definitely something from Linkedin's code.
On the other hand, the above code does have errors. It should be html_node (html, css="div")

I'm working on an algorithm that allows me to scrape web to study the skills that are being required in the labor field. When I get the first results I will gladly share them with you.

Thank you for your help,

I think you need to get more cautions with what you are allowed to do when scraping.
It seems the path you want to scrape is not allowed to be scraped with R

robotstxt::paths_allowed("https://www.linkedin.com/search/results/people/?keywords=django&origin=SWITCH_SEARCH_VERTICAL")
#> 
 www.linkedin.com                      No encoding supplied: defaulting to UTF-8.
#> [1] FALSE

library(polite)
url <- "https://www.linkedin.com"
session <- bow(url)
#> No encoding supplied: defaulting to UTF-8.
session %>%
  nod(path = "search/results/people/?keywords=django&origin=SWITCH_SEARCH_VERTICAL")
#> <polite session> https://www.linkedin.com/search/results/people/?keywords=django&origin=SWITCH_SEARCH_VERTICAL
#>      User-agent: polite R package - https://github.com/dmi3kno/polite
#>      robots.txt: 1831 rules are defined for 33 bots
#>     Crawl delay: 5 sec
#>   The path is not scrapable for this user-agent

Created on 2019-08-31 by the reprex package (v0.3.0)

polite is a nice package that wraps rvest :package: to do scraping in a responsible way. See presentation at user2019

(slides) and website

Other resources on this topic:

If you want to access Linkedin data, I think you should use the API

Hope it helps

4 Likes

Yes, you've helped me a lot. Thank you

I did not know this information, it is very important to keep it in mind to make responsible use of Web Scraping from R.

On the other hand, will you reference tutorials or guides to use the Linkedin API from R?

Thank you very much.

There is this CRAN package but it is a little old

You'll find a demo in httr :package: that is more recent and show how to connect. The source is here

You'll find it under demo(package = "httr") in your R session. After getting the tokens, using the API following the developers docs should be easy I guess by just finding the endpoints you want.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.