Scraping the ubiquitous arcgis dashboards

I just had to modify my scraping code again since yet another website migrated their COVID-19 data to the ubiquitous ArcGIS template. I hate those dashboards, especially the map with bubbles on it. That has to be the worst possible way to illustrate the data. But anyway, as a service to the community, I thought I would document how I scrape these sites, to help anyone else who may be trying to figure it out. I don't claim this is the best way to do it, all I can claim is that it works for me. This data happens to be for the Texas prison system.

For discovering the magical XML text to pull out the desired part of the page, I use the built-in inspect function in Chrome or Firefox to highlight the relevant section of the page and then right-click to get Copy->Xpath. But there are numerous on-line references on how to do this.

I run several of these every evening on a cron job.

library(tidyverse)
library(stringr)
library(xfun) # because RSelenium needs it internally

url <- "https://txdps.maps.arcgis.com/apps/opsdashboard/index.html#/dce4d7da662945178ad5fbf3981fa35c"

# start the server and browser in headless mode
rD <- RSelenium::rsDriver(browser="firefox",
               extraCapabilities = list("moz:firefoxOptions" = list(
                 args = list('--headless')))
)

driver <- rD$client

# navigate to an URL
driver$navigate(url)
Sys.sleep(9)

# get parsed page source
parsed_pagesource <- driver$getPageSource()[[1]]

#close the driver
driver$close()

#close the server
rD$server$stop()

#   Save in case the rest of the code crashes, like when they update the page on you
saveRDS(parsed_pagesource,paste0("/home/ajackson/Dropbox/Rprojects/Covid/DailyBackups/",lubridate::today(),"_ParsedPagePrisons.rds"))

#---------------------------------------------------------------------
#   Extract prison info
#---------------------------------------------------------------------

result <- xml2::read_html(parsed_pagesource) %>%
  # select out the part of the page you want to capture
  rvest::html_nodes(xpath='//*[@id="ember194"]') %>%
  # convert it to a really long string, getting rid of html
  rvest::html_text() %>% 
  # there are a lot of carriage returns in there, let's clean them out
  str_replace_all("\n"," ") %>% 
  # Split string on long strings of spaces, returning a list
  str_split("  +")

 
# get rid of title and extra line at end
result <- result[[1]][3:(length(result[[1]])-1)]

# every other element of list is a Unit, so let's combine the Unit name
# with the table it used to head, to get the first iteration of a data frame
res <- cbind.data.frame(split(result, 
                              rep(1:2, times=length(result)/2)), 
                        stringsAsFactors=F)
#assign some better names
names(res) <- c("Unit", "foo") 

res <- res %>% 
  # add dash after numbers for later splitting
  mutate(foo=str_replace_all(foo, "(\\d) ", "\\1 -")) %>% 
  # remove all whitespace, some are tabs
  mutate(foo=str_remove_all(foo, "\\s*")) %>% 
  # remove commas from numbers
  mutate(foo=str_remove_all(foo, ",")) %>% 
  # split the field into 12 pieces
  separate(foo, letters[1:12], sep="-") %>% 
  # select out the numeric fields
  select(Unit, b,d,f,h,j,l) %>% 
  # make them numeric
  mutate_at(c("b","d","f","h","j","l"), as.numeric)

# give every field a bright, shiny new name
names(res) <- c("Unit", 
                "Offender Active Cases",
                "Offender Recovered",
                "Employee Active Cases",
                "Employee Recovered",
                "Medical Restriction",
                "Medical Isolation")


# add a field with today's date
res <- res %>% mutate(Date=lubridate::today()) 

# let's see what it looks like - this is for QC
res

#  now save or do whatever.....

This post was exactly what I was looking for. They may have changed the code again, or I may not fully understand how it all goes together, but I couldn't get ember194 to work. When I ran it with ember199 in the xpath it seemed to work for me. Thank you for this. Would you happen to have this script where it runs each night and is published somewhere to review online? I'm an epidemiologist and have been asked to group these numbers together to display them by the total number within each of my respective counties each day. It seems like duplicate work since they already have the dasboard published and anyone can view it within them, but that's what I've been asked to figure out for my local area. Thank you again for making this and your help.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

I should note that on my to do list is to use the prison data to clean up the county data, since for at least some counties, they recently began adding the prison cases to the county totals, resulting in large jumps in some cases. I haven't yet decided what I am going to do, I need to do some experiments. Right now I'm finishing up a study on how to do piece-wise fitting of exponentials to the data, since it is clear that the doubling time changes based on the take-up of counter-measures. I'm pretty chuffed with it so far, it looks like it will help flag issues and improve the near term predictions.
The prison data is being collected but not updated right now because the website was so unstable for awhile I gave up and just capture the page and save it. But now I need to get that working so I can do the corrections to the full dataset.

Happy to help.

Yes, every time they rearrange the website, the "ember 194" has to be changed, sadly.

All my code is on github, at https://github.com/alankjackson/Covid

Files beginning with "Update_" go scrape data from somewhere. I finally realized that it would make the process more robust to have a single scraper for each website, so if one failed, the rest were unaffected.

I run on Linux, so I run the Update files on a cron, then I preprocess the data locally from a cron, and then I upload the data to my own website where shiny can access it. That is to relieve the shiny server of some of the CPU load - it was starting to time out during the preprocessing. There may be better ways of doing that, but it's what I have evolved to.

The cron looks like:
#--------------- State Health Department Covid data, download daily
30 18 * * * /usr/lib/R/bin/Rscript '/home/ajackson/Dropbox/Rprojects/Covid/UpdateData.R' >> '/home/ajackson/Dropbox/Rprojects/Covid/Retrieve.log' 2>&1
30 23 * * * /usr/lib/R/bin/Rscript '/home/ajackson/Dropbox/Rprojects/Covid/Update_Prison.R' >> '/home/ajackson/Dropbox/Rprojects/Covid/Retrieve.log' 2>&1
31 23 * * * /usr/lib/R/bin/Rscript '/home/ajackson/Dropbox/Rprojects/Covid/Update_Tests.R' >> '/home/ajackson/Dropbox/Rprojects/Covid/Retrieve.log' 2>&1
32 23 * * * /usr/lib/R/bin/Rscript '/home/ajackson/Dropbox/Rprojects/Covid/UpdateHarrisZipcodeData.R' >> '/home/ajackson/Dropbox/Rprojects/Covid/Retrieve.log' 2>&1
40 23 * * * /usr/bin/weex mylinux
#--------------- Universal calculator
35 23 * * * /usr/lib/R/bin/Rscript '/home/ajackson/Dropbox/Rprojects/Covid/Build_Covid_Files.R' >> '/home/ajackson/Dropbox/Rprojects/Covid/Build.log' 2>&1

The actual data is available at:
https://www.ajackson.org/Covid/
Today_County_calc.rds
Today_County_pop.rds
Today_TestingData.rds
Today_MSAs.rds
Today_MSA_raw.rds
Today_Prison_data.rds
Today_Prison_county.rds
Today_MappingData.rds
Today_MapLabels.rds

This all feeds https://ajackson.shinyapps.io/CovidTexas/

1 Like

I was able to take the data for today, and now figure out how to total the number of cases in each of the units by the counties they are in that I'm interested in. Here's the code I used below. It may not be the cleanest, but it got the job done for anyone that may be interested or find it helpful.


#these are all the units I'm interested in
neth <- c("Michael", "Coffield", "Gurney", "Beto", "Johnston")

#splitting the units by the county they fall under 
Anderson <- c("Michael", "Coffield", "Gurney", "Beto")
Wood <- c("Johnston")

#take the "res" results from above and add a county column and label the appropriate ones
res2 <- res %>%
    filter(Unit %in% neth) %>%
    mutate(County = ifelse(Unit == Wood, "Wood", "Anderson"))%>%
    arrange(County) %>%
    group_by(County)

#alternative way to do this   
#res2$County <- if_else(res2$Unit==Wood, "Wood", "Anderson")  


#since there's multiple units in Anderson, we need to filter just them out and total all the numbers in the county 
resA <- res2 %>%
    summarize(`Unit` = "Total_Units", 
        `Offender Active Cases` = sum(`Offender Active Cases`), 
        `Offender Recovered` = sum(`Offender Recovered`), 
        `Employee Active Cases` = sum(`Employee Active Cases`), 
        `Employee Recovered` = sum(`Employee Recovered`), 
        `Medical Restriction` = sum(`Medical Restriction`), 
        `Medical Isolation` = sum(`Medical Isolation`), 
        `Date` = date()) %>%
    filter(County == "Anderson")

#then combines the new total column back into the first group, for those wanting to compare across Units 
resT <- rbind(resA, res2)
resT$Date <- date()

#view it to see how it looks 
View(resT)

#transposing the data to see it a different way 
res3 <- as.data.frame(t(resT))

#view in it's different format
View(res3)

#saving to your working directory
write.csv(res3, "res3.csv")

Wow, you've done a lot of great work. That's awesome. I'll definitely have to explore it some more. Yeah, they've asked a lot of us to start reporting the prison data and regular counts together. I don't know if they are going to keep it that way forever though. It really doesn't make sense duplicating the work as it seems it would confuse people, especially since TDCJ is already reporting it, and they have their own system outside of the regular Texas data.

I'd be interested to see the study when you get it finished. Thanks again

Also see adelieresources.com. Back in April I did some analysis of prisons and testing.