Loop FOR with Rvest for scraping

Pidroz · March 11, 2020, 9:43am

Hello Folks
I would like to create a forloop to scrap (with Rvest) H1 tag for a list of urls.
I tried to do it with the following code but it doesn't work.
Does somebody can help me ?

Thanks !
library(rvest)
library(readr)
library(tidyverse)
library(XML)
library(httr)

#URLs list loading
urls <- c("Coronavirus : Actualités, vidéos, images et infos en direct - 20 Minutes")

#I create an emplty list
tbl <- list()

#I start forloop
for (i in 1:length(urls)) {
tbl[[i]] <- urls[[i]] %>% # tbl[[i]] assigns each H1 from urls as an element in the tbl list
read_html() %>%
html_nodes("h1") %>%
html_text() %>%
if (dim(tbl)[i] == 0){
i = i+1
}}
tbl

Error message in Console
Error in if (.) dim(tbl)[i] == 0 else { :
the argument cannot be interpreted as a logical value

pieterjanvc · March 11, 2020, 11:23am

Hi,

Your code looks confusing and I can't follow the process. Could you please provide us with a reprex? A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Good luck,
PJ

Pidroz · March 12, 2020, 8:31am

Hello @pieterjanvc pieterjanvc
Thank you for your reply
I did some modifications to clarify my code.
Could you please tell me if you need anything else ?

thks
Pierre

andresrcs · March 13, 2020, 3:01am

There is no need to manually update the index (and it is also wrong syntax) if you remove this part your code works, but I would like to propose this solution instead of a for loop

library(tidyverse)
library(rvest)

urls <- c("https://www.20minutes.fr/dossier/coronavirus","https://www.20minutes.fr/economie/")

tbl <- map(urls, ~ {
    .x %>%
        read_html() %>%
        html_nodes("h1") %>%
        html_text()
})

tbl
#> [[1]]
#> [1] "Coronavirus"
#> 
#> [[2]]
#> [1] "Économie"

^{Created on 2020-03-13 by the reprex package (v0.3.0.9001)}

system · April 3, 2020, 3:01am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.