How to clean up the output of rvest?

Rekha_Verma · January 24, 2023, 4:55pm

I am trying to fetch one website link through web scrapping using rstudio cloud with the code I shared. But the result I am getting in console as twice the website name and the name of the website link up with NA. How to remove this NA from website name?

install.packages("rvest")
install.packages("dplyr")
library(rvest)
library(dplyr)

link = "https://tu■■■■a.info/"
page = read_html(link)
website_links = page %>% html_nodes("h1")%>% html_attr("href") %>% paste("http://www.tu■■■■a.info",.,sep="")
website_links


> website_links
[1] "http://www.tu■■■■a.infoNA" "http://www.tu■■■■a.infoNA"

scottyd22 · January 24, 2023, 9:52pm

Welcome to the community @Rekha_Verma! I am unable to see the website you are trying to scrape. I'm not sure if it is blurred intentionally, but can you share the link again?

It looks like page %>% html_nodes("h1")%>% html_attr("href")is returning a vector with two NA values, which is why NA is being added in your paste statement (as shown below).

c(NA, NA) %>% paste("http://www.tu■■■■a.info",.,sep="")
#> [1] "http://www.tu■■■■a.infoNA" "http://www.tu■■■■a.infoNA"

If you are able to share the link, then I/we can troubleshoot further.

Rekha_Verma · January 24, 2023, 11:27pm

Hi, thank you for the reply. I am trying to send you the website link: https://tu■■■■a.info/
but again, a few letters are not visible.

williaml · January 24, 2023, 11:44pm

Hi, are you after this?

page %>% 
  html_nodes("a") %>%
  html_attr("href") 


# [1] "https://tu■■■■a.info"                                                                                 
# [2] "https://www.facebook.com/Tu■■■■a.Meditation.Centre"                                                   
# [3] "https://www.youtube.com/user/Tu■■■■aMcLeodGanj"                                                       
# [4] "https://tu■■■■a.info/"                                                                                
# [5] "https://tu■■■■a.info/about-us/"                                                                       
# [6] "https://tu■■■■a.info/about-us/"                                                                       
# [7] "https://tu■■■■a.info/about-us/our-spiritual-guides/"                                                  
# [8] "https://tu■■■■a.info/about-us/holy-objects-at-tu■■■■a/"                                               
# [9] "https://tu■■■■a.info/about-us/history-of-tu■■■■a/"                                                    
# [10] "https://tu■■■■a.info/about-us/board-of-directors/"

Assuming that the link is censored here because of sh** being in the domain name.

Rekha_Verma · January 24, 2023, 11:48pm

Thanks; I am new to the R language. Can we still solve the actual problem with a censored link?

williaml · January 24, 2023, 11:49pm

I think the censoring is just on this forum. It isn't an issue within R itself.

Rekha_Verma · January 24, 2023, 11:57pm

Ok. I am getting the website's name twice, and the name of the website links up with NA because of the censoring on this forum. Can we still resolve it with the forum issue?

williaml · January 24, 2023, 11:59pm

Your code

website_links = page %>% html_nodes("h1")%>% html_attr("href") %>% paste("http://www.tu■■■■a.info",.,sep="")
website_links

doesn't pull the right links. NA is not a link.

The h1 values are not links on the website:

page %>% 
 html_nodes("h1")

# {xml_nodeset (2)}
# [1] <h1>Tu■■■■a Meditation Centre</h1>
# [2] <h1>Tu■■■■a Meditation Centre</h1>

Rekha_Verma · January 25, 2023, 12:09am

Great, it works out. I use the code of the link instead of the text. Thank you so much.

technocrat · January 25, 2023, 9:28am

or, if you already have pulled a lot

link <- "http://www.tu■■■■a.infoNA"
gsub("NA","",link)
#> [1] "http://www.tu■■■■a.info"

^{Created on 2023-01-25 with reprex v2.0.2}