Web scraping problem

I have used numerous methods to scrape this web page ... but none of them work, EXACTLY. If it is NOT exact, then I might as well copy/paste it ... the problem being there are about 50 of them.

Suggestions?

Thanks
R Watson

Oops

web page
https://water.sam.usace.army.mil/gage/acf/prob1-19.txt

Hi,

Welcome to the RStudio community!

The file is in a horrible format indeed, not adhering to any of the rules :slight_smile:
Assuming they all are in a similar format, this is my approach:

library(stringr)
library(purrr)

#Read per line
myTable = readLines("https://water.sam.usace.army.mil/gage/acf/prob1-19.txt")
#Extract the col names (assuming always 4th line)
tableColNames = str_extract_all(myTable[[4]], "[^\\s+]+") %>% unlist

#Merge all values
myTable = map(myTable[1:length(myTable)], function(x){
  #Extract the values, but also white space that's an NA (exactly 7 spaces)
  vals = as.numeric(str_match_all(x, "[\\s]{3}(\\s{7}|\\d+\\.?\\d*)")[[1]][,2])
  
  #Add NAs to end if last values are empty (not put in file as empty space)
  if(length(vals) < 12){
    vals = c(vals, rep(NA, 12 - length(vals)))
  }
  
  #Only return lines where there are values (discard the rest)
  if(is.na(vals[1])){
    c()
  } else {
    vals
  }

})

#Bind all rows together to make a df
myTable = as.data.frame(do.call(rbind, myTable))

#Put in row and column names
rownames(myTable) = c(1:(nrow(myTable) - 3), "AVG", "MAX", "MIN")
colnames(myTable) = tableColNames

RESULT

> tail(myTable)
        OCT     NOV     DEC     JAN     FEB     MAR     APR     MAY     JUN     JUL     AUG     SEP
29  1069.38 1070.33 1073.39 1072.58      NA 1072.52 1072.46 1070.85 1071.35 1070.82 1069.79 1067.81
30  1069.36 1070.30 1073.58 1072.39      NA 1072.49 1072.31 1070.83 1071.35 1070.77 1069.71 1067.73
31  1069.33      NA 1073.80 1072.12      NA 1072.47      NA 1070.79      NA 1070.72 1069.68      NA
AVG 1069.65 1070.24 1071.07 1073.14 1072.89 1074.05 1072.59 1071.50 1071.36 1071.13 1070.32 1068.69
MAX 1070.05 1071.52 1073.80 1074.69 1076.10 1075.99 1073.53 1072.15 1071.72 1071.42 1070.79 1069.64
MIN 1069.31 1069.32 1070.18 1072.12 1071.06 1072.47 1071.97 1070.79 1070.73 1070.72 1069.68 1067.73

Let me know if it works for the rest too...

Hope this helps,
PJ

Thank you very much.
This did work for all of them.
Bob

Actually, this is the line I needed:

myTable = readLines("https://water.sam.usace.army.mil/gage/acf/prob1-19.txt")

I swear (undoubtedly I didn't) I tried this previously. I needed to scrape about 50 files (plus/minus).
I put this line in a function with the filename actually being a parameter with a "driving for loop".
The speed with which it scraped the 50 sites was unbelieveable (about one second) ...
so fast I thought something must have been wrong (but it was not).

Once I got the 50 onto my PC, I parsed them with a Delphi program They were "enough different"
that your routine did not fully work ... also there are some negative numbers and apparently
your regis string did not like the negative sign ... and don't know enough of that stuff to change it.

BUT (all in all) everything is good now and I am still amazed at one second for scraping.

Thanks,
Bob

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.