Hi everyone! First time posting here. I have a vector of URLs, and I'm hoping to scrape each of them and get a vector - not a list - of the results. Anyway, here's a small portion of the URL vector:
My end goal is to be able to add a column of the results to a dataframe with those URLs, but have just been playing around with a vector of the URLs to learn. Any insight is appreciated! Thanks!
Thanks so much for responding! That seems like a promising approach! I should have mentioned it in the previous post, but not all of the URLs are going to have the ".file.size" class, but I'll still want to include the result in the new column. However, when I run this on the full links_short object...
... I get this error: "Error: Column content must be length 1 (the group size), not 0"
It works fine when I run it on just the element that has that class, but not those without. Is there a way to do this when not all URLs will have this feature, or is that going to be a hard roadblock? Or maybe there's a way to first filter out URLs without it? Anyway, again really appreciate the help!
html_node vs html_nodes
html_node is like [[ it always extracts exactly one element. When given a list of nodes, html_node will always return a list of the same length, the length of html_nodes might be longer or shorter.
I assume in your case you can be sure that the html page has no more than one file-size nodes.
Each url without the class file-size will generate a NA. All others will result in the size.
I am not sure if you prefer NAs or 0s for missing values. I added the convertion to 0 (and double) - remove the %>% replace_na(0) part to keep NAs.
I added an ungroup() in case you want to work with the data and are not familiar rowwised tibbles.
This looks even more promising. Can't wait to try it! Yeah, I was unfamiliar with rowwise, and clearly need to do some reading up on it. Many thanks again! This is great!