Find specific string in multiple txt files and to create a dataframe with key-files words

learning1 · May 30, 2023, 1:59pm

Hello Posit Community,
I would like to know how to read multiple txt files that contain variables such as year, month, day, temperature max, temperature min, precipitation. Then create a dataframe with all files and finally add new column with the name of each city.
The name of the city is contained in the txt file after a certain string for each file ("Name of time serie"). In this example, I would extract the word "Barcelona" and then fill a new column with this city name. Then I want to do the same for each txt file
Txt structure example

#######
tex_file: 1
Name of time serie: Barcelona
text_line1
text_line2
text_line3
(ten or eleven of descriptive lines information. For each txt file could be different)

year  month   day   PPTX   TX   TN  
1950   1      1       0     12.9  2.1
1950   1      2       0     14.9  3.1
1950   1      3       0     13.5  4.1
(data continue until 2022)...
#######

I am trying with this code but I dont know how to continue creating the newcolumn with the name of the city for each txt file (60)

files_to_read <-list.files("C:/Users/RStudio/project")

 read_a_file <- function(x) {
 fread(
 file = file.path( "C:/Users/RStudio/project", x),
select = c("year", "month", "day", "PPTX", "TX", "TN"),
  )
  
 myresults<-purrr::map_dfr(files_to_read,
               read_a_file)

scottyd22 · May 30, 2023, 6:39pm

Below is one approach, which uses stringr to parse the text to extract the city name.

files_to_read <-list.files("C:/Users/RStudio/project",
                           full.names = T)

read_a_file <- function(x) {
  out = data.table::fread(
    file = x,
    select = c("year", "month", "day", "PPTX", "TX", "TN"),
  )
  
  out$city = stringr::str_remove(read.csv(x)[1,1], 'Name of time serie: ')
  
  out
}

purrr::map_dfr(files_to_read, read_a_file)

FactOREO · May 30, 2023, 6:43pm

Your problem is the "non-standard" structure of your tabular data. You might want to use a command line tool for such a task (e.g. use awk within a Linux distribution) to deal with this issue. From within R, you could do the following:

library("data.table")

# read the data
data <- fread("./myFile.txt")
# read the city
city <- system(command = 'grep "Name of" myFile.txt | head -1 | cut -d":" -f2', intern = TRUE) |> trimws()
# combine
data[, city := city]

data

   year month day PPTX   TX  TN       city
1: 1950     1   1    0 12.9 2.1 Barcelona
2: 1950     1   2    0 14.9 3.1 Barcelona
3: 1950     1   3    0 13.5 4.1 Barcelona

This code will perform a chain of command line tool operations started by R via system() to get the filename. @scottyd22 provided a useful option if you really have to use tools from within R, but from my perspective this is really overkill given the few command line tools you need to actually solve this task. If you are running R on a Windows machine this might fail however, since those command line tools are not available there (or I don't know how they are called within the Windows command line).

learning1 · May 31, 2023, 10:43am

@scottyd22 , that was a nice code-approach for this reading files problem. At least in my code, the "readLines" function ,to get the whole text as a "character", worked better (I think it dependes of txt file , that I did not provide it).
Then, I had to look for the string position in the brackets as you well indicated.

Thanks for your time and your help!

learning1 · May 31, 2023, 10:54am

Thank you very much for your help. Amazing and elegant approach with Linux commands; laconic code is well received in the eyes of any reader.
That works well for a file example. Thus, this could be succesfully extrapolated for a whole directory.

system · June 7, 2023, 10:54am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.