Automated web scraping with the targets package

Hi everyone,

Disclaimer: this question does not contain any reproducible code

I wrote a script to scrape this website: https://gasprices.aaa.com/ with the {targets} package. Since the data on the wesbsite is updated everyday, I would like to schedule the script to run once a day with Github Actions. The issue I think I will face; however, is that since there will essentially be no change in my script whenever it is reexecuted everyday, then no scraping will happen. I am using {targets}, so I think that I will just get the usual skip pipeline messages in the log.

Is my line of thinking correct? And if so, how can I solve this issue?

Here is a small reproducible exemple of my _targets.R file:

# Load packages ----

library(targets)
library(rvest)

# Scraping function ----

scrape_gas_price <- function(url){
  read_html(url) %>%
    html_element(css = "p.numb") %>%
    html_text() %>%
    stringr::str_squish() %>%
    readr::parse_number()
}

# Targets ----

list(
  tar_target(url, "https://gasprices.aaa.com/"),
  
  tar_target(gasprice, scrape_gas_price(url)),
  
  tar_target(save_gasprice, saveRDS(gasprice, paste0(Sys.Date(), ".rds")))
)

PS:

  • I know that one solution could be to delete the _targets/ directory in the project . This will force the whole pipeline to rerun everyday; however, I see this solution as a hack.
1 Like

This is really a question for GitHub Actions.

But, if you are using gits scheduled actions my understanding was it runs using a cronjob type of action on the last commit on the main branch.

If you don't think it is running - can you get it to create a file with timestamp in it?

Maybe try tar_target(…, “your_url.com/file.csv”, format =“url”). That target will check the file at the url using the last modified timestamp and ETag if availabile and invalidate the target automatically if either changes. For running targets on GitHub actions, the tar_github_actions() function generates a workflow file, and GitHub.com/wlandau/targets-minimal is an example.

Thanks for your response @wlandau and, especially, a great package.

Actually, the script is not downloading a file, it is actually scraping data from a webpage and storing the data locally (which is later pushed to a Github repo). So, I am not quite sure how to use your suggestion in this case: tar_target(…, “your_url.com/file.csv”, format =“url”). Here is an example webpage that is scraped: AAA Gas Prices

Someone in a Slack channel I belong to suggested using a cue, which I had never heard about. Would this be a possible solution to explore?

Thank you.

Also, thank you for letting me know about tar_github_actions().

EDIT

I added the cue = tar_cue(mode = "always") argument to the main scraping target, however, it does not seem to run on Github Actions. It runs perfectly on my computer, though.

EDIT 2

I modified the targets.yaml file a bit and now it works as intended. The yaml file produced by tar_github_actions() contains several lines, which I do not understand. I just rewrote it with simpler tasks (e.g. installing packages manually...). This is what it looks like now:

# Hourly scraping
name: us_gas_prices_scraper

# Controls when the action will run.
on:
  push:
    branches:
      - main
      - master


jobs: 
  autoscrape:
    # The type of runner that the job will run on
    runs-on: macos-latest

    # Load repo and install R
    steps:
    - uses: actions/checkout@master
    - uses: r-lib/actions/setup-r@master

    # Set-up R
    - name: Install packages
      run: |
        R -e 'install.packages(c("targets", "rvest", "dplyr", "stringr", "purrr", "here", "glue"))'
    
    - name: Run scraper
      run: |
        Rscript _targets.R
        R -e 'targets::tar_make()'

    # Add new files in data folder, commit along with other modified files, push
    - name: Commit files
      run: |
        git config --local user.name github-actions
        git config --local user.email "actions@github.com"
        git add .
        git commit -am "US gas price data scraped on $(date)"
        git push origin master
      env:
        REPO_KEY: ${{secrets.GITHUB_TOKEN}}
        username: github-actions

Thank you for your response @CALUM_POLWART

Would you please explain why you think this is a GHA question?

And yes, your idea of creating a file with a timestamp seems like something to look into. Thank you very much.

1 Like

Great, sounds like you solved the issue. And yes, a cue is a good workaround. tar_cue(mode = "always") is great if you always want to scrape the data. Then, if the hash of that data did not change since last run, the downstream targets may be skipped. tarchetypes::tar_change() is another way to go about this if you have some way of checking the modification time etc. of the website you are scraping, but that may not be necessary in your case if the actual scraping step is computationally efficient.

1 Like

Well because this isn't an R issue. It's a GitHub issue unless I'm completely missing the point?

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.