How to clean up the output of rvest?

I am trying to fetch one website link through web scrapping using rstudio cloud with the code I shared. But the result I am getting in console as twice the website name and the name of the website link up with NA. How to remove this NA from website name?

install.packages("rvest")
install.packages("dplyr")
library(rvest)
library(dplyr)

link = "https://tu■■■■a.info/"
page = read_html(link)
website_links = page %>% html_nodes("h1")%>% html_attr("href") %>% paste("http://www.tu■■■■a.info",.,sep="")
website_links


> website_links
[1] "http://www.tu■■■■a.infoNA" "http://www.tu■■■■a.infoNA"

Welcome to the community @Rekha_Verma! I am unable to see the website you are trying to scrape. I'm not sure if it is blurred intentionally, but can you share the link again?

It looks like page %>% html_nodes("h1")%>% html_attr("href")is returning a vector with two NA values, which is why NA is being added in your paste statement (as shown below).

c(NA, NA) %>% paste("http://www.tu■■■■a.info",.,sep="")
#> [1] "http://www.tu■■■■a.infoNA" "http://www.tu■■■■a.infoNA"

If you are able to share the link, then I/we can troubleshoot further.

1 Like

Hi, thank you for the reply. I am trying to send you the website link: https://tu■■■■a.info/
but again, a few letters are not visible.

Hi, are you after this?

page %>% 
  html_nodes("a") %>%
  html_attr("href") 


# [1] "https://tu■■■■a.info"                                                                                 
# [2] "https://www.facebook.com/Tu■■■■a.Meditation.Centre"                                                   
# [3] "https://www.youtube.com/user/Tu■■■■aMcLeodGanj"                                                       
# [4] "https://tu■■■■a.info/"                                                                                
# [5] "https://tu■■■■a.info/about-us/"                                                                       
# [6] "https://tu■■■■a.info/about-us/"                                                                       
# [7] "https://tu■■■■a.info/about-us/our-spiritual-guides/"                                                  
# [8] "https://tu■■■■a.info/about-us/holy-objects-at-tu■■■■a/"                                               
# [9] "https://tu■■■■a.info/about-us/history-of-tu■■■■a/"                                                    
# [10] "https://tu■■■■a.info/about-us/board-of-directors/"      

Assuming that the link is censored here because of sh** being in the domain name.

2 Likes

Thanks; I am new to the R language. Can we still solve the actual problem with a censored link?

I think the censoring is just on this forum. It isn't an issue within R itself.

Ok. I am getting the website's name twice, and the name of the website links up with NA because of the censoring on this forum. Can we still resolve it with the forum issue?

Your code

website_links = page %>% html_nodes("h1")%>% html_attr("href") %>% paste("http://www.tu■■■■a.info",.,sep="")
website_links

doesn't pull the right links. NA is not a link.

The h1 values are not links on the website:

page %>% 
 html_nodes("h1")

# {xml_nodeset (2)}
# [1] <h1>Tu■■■■a Meditation Centre</h1>
# [2] <h1>Tu■■■■a Meditation Centre</h1>
1 Like

Great, it works out. I use the code of the link instead of the text. Thank you so much.

1 Like

or, if you already have pulled a lot

link <- "http://www.tu■■■■a.infoNA"
gsub("NA","",link)
#> [1] "http://www.tu■■■■a.info"

Created on 2023-01-25 with reprex v2.0.2