Failed to connect port 443: Connection timed out

Hello, I'm planning a short workshop on web scraping and want the students to be able to use RStudio Cloud. When trying to connect to the site of interest to check the robots.txt file I repeatedly get this error:

Error in curl::curl_fetch_memory(url, handle = handle) : Failed to connect to port 443: Connection timed out

The second line is the most important as it is common to a few errors I've received when trying different ways to connect.

This doesn't happen when using RStudio desktop. Is it a proxy thing? If it is - what does that mean and what would a solution look like?

Thanks for reading!

Code to reproduce my error:

rt = robotstxt("")

I expect to get rt, a list of 11 elements - this is what happens running on my computer.

Trying on my own site doesn't give me an issue in the cloud.

rt_works = robotstxt("")

Hi @lizab!

I was able to reproduce your issue, but investigating a bit further using a few different aws zones and regions this actually appears to be (or their parent host, blocking all or nearly all AWS-based ip addresses. I'm not exactly sure why they would go about blocking all of these addresses in this way, but unfortunately it isn't really something we can have much of an affect on, as this issue appears to be pervasive far beyond the few addresses we manage.

Sorry about that!


Thank you so much for the effort, @stevenolen.
What a shame, it's not obvious in their T&Cs or robots.txt. sigh

Cheers! :beers:

1 Like

That sucks, because learning web scraping by scraping sounds fantastic :partying_face:

1 Like

I suspect that they are blocking AWS access in order to reduce the occurrence of spam bots posting reviews.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.