How to get around Cloudflare JS challenges?

dootmikes · May 23, 2023, 12:32pm

I'm relatively new to coding in general, and web scraping in particular. I'm working on a project to scrape about 170 webpages and have figured out what I want to pull, but the website I'm looking at is using Cloudflare and is blocking requests from my Rstudio build. The site isn't blocked on the VPN IP I'm using since I can still load the page in Chrome. I can also download individual pages as .html files and upload them directly into R to run my scraping.

The site does not require me to click a button agreeing to its terms and conditions and does not require logging in.

library(httr)
library(rvest)

for(url in url){
  url <- httr::GET("(WEBSITE REDACTED)",
           user_agent(ua)
  Sys.sleep(10)
}

Output: Error in open.connection(x, "rb") : HTTP error 403.

My guess is that Cloudflare is using a JS challenge that my scraper can't beat. Here's the output when I run GET with verbose.

-> GET /city/78/(WEBSITE REDACTED)/ HTTP/1.1
-> Host: (REDACTED)
-> User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36
-> Accept-Encoding: deflate, gzip
-> Accept: application/json, text/xml, application/xml, */*
-> a: 1
-> b: 2
-> 
<- HTTP/1.1 403 Forbidden
<- Date: Sat, 20 May 2023 00:42:16 GMT
<- Content-Type: text/html; charset=UTF-8
<- Transfer-Encoding: chunked
<- Connection: close
<- Cross-Origin-Embedder-Policy: require-corp
<- Cross-Origin-Opener-Policy: same-origin
<- Cross-Origin-Resource-Policy: same-origin
<- Permissions-Policy: accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),usb=()
<- Referrer-Policy: same-origin
<- X-Frame-Options: SAMEORIGIN
<- cf-mitigated: challenge
<- Cache-Control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
<- Expires: Thu, 01 Jan 1970 00:00:01 GMT
<- Report-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=Btlukxxdgp1Li12sZhKpBUgryZpsC8xeSxnFeiEb3W%2FpGOBhe94xkN5AbrzRnI%2BwWCAFWzi92HTs6qHgAhfGsi%2FqmsCP0P2FoHNVDlvnPA%2FiYBIzRFD4DLA3MuC66NFqDqY%3D"}],"group":"cf-nel","max_age":604800}
<- NEL: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
<- Vary: Accept-Encoding
<- Server: cloudflare
<- CF-RAY: 7ca089eee81f4871-DFW
<- Content-Encoding: gzip
<-

Attempting to access the page via Rselenium gets my browser caught in an endless "Checking if the site connection is secure" loop.

I know of a few libraries in python (cfscrape, for example) that are built to pass the JS challenge, but haven't found anything in an R package. Is there something out there I can use to get access to these pages?

system · June 13, 2023, 12:33pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.