Excellent advice. I scrape a few sites for deals and push them into a slack channel every few minutes for a side hustle. The websites often change URLs, HTML tags, div names, etc. Trying to keep up with it all is a pain, but if you have good error handling and logging it makes it way more manageable.
For managing my web scrapers, I run them on a VM on Google cloud. $300 credit for a year for each email, and Gmail accounts are free… Plus free tier options means you can start using a lot and scale down to free pretty easily.
I use Jenkins pipelines for building, executing and logging the code that runs. Jenkins is usually used in web development for testing and deployment based on changes to source control, but it is actually just a super deluxe crontab app. It’s a lot better than just cron or task scheduler in Windows because it stores the results of multiple builds (i.e. schedules, runs, executions, etc.) and automatically passes anything sent to stdout to the log file so you don’t need a bunch of extra logging statements in your code. Best of all, it’s super easy to install and use on both Linux and windows and has a boat load of documentation online.
For comparing old data versus the current data from the most recent scrape, I store results in a MySQL database and call an
INSERT ... ON DUPLICATE KEY UPDATE ... statement.