Read selected columns from multiple CSV and create column with source ID

salvelinusbob · April 8, 2018, 4:42pm

We get groundwater depth data from data loggers as a .txt file. The format of this data not changeable. It has consistent column names except that the first column name contains the well id. For each well, the data is structured:

"well 02,Time,Current(feet),Serial Number
1,2018-02-11 11:18:44,-5.00,020013603
2,2018-02-11 17:18:44,7.30
3,2018-02-11 23:18:44,5.40
4,2018-02-12 05:18:44,0.80
5,2018-02-12 11:18:44,12.60..."

"Well 17,Time,Current(feet),Serial Number
1,2018-02-11 00:32:01,-5.00,000025390
2,2018-02-11 06:32:01,5.45
3,2018-02-11 12:32:01,5.40
4,2018-02-11 18:32:01,5.40..."

I am trying to create a script that merges csv .txt files from multiple into one long data frame. I only need column 2 and 3 from these files. I also need to add an id column noting which well the observation came from. In the end, I would like the to have a 3xn df with colnames = well_id, obs_date, observation

Note that the file name also contains the well id.

So far...

txt_list <- list.files(pattern = ".txt", full.names = TRUE)
txt_read_csv <- map(txt_list,
read_csv,
col_names = TRUE,
col_types = "iTdc"
)

...gets me a list of tibbles. But I cannot figure out how to select and mutate tibbles within a list to reduce the columns and assign an ID. I tried map_df, but since the first column name is different in each .txt the df explodes into a wide table with multiple columns for each different colname.

I can do this with a loop, but I'd really like to keep it tidy.

Thanks for any help.

mishabalyasin · April 8, 2018, 5:14pm

Hi, I think you are pretty much there with your approach, one small step left is to include ID inside of a dataframe. I would have included actual code, but since your problem is not reproducible I'll write what I think is close enough and you'll need to adapt it a bit to your requirements.

Main idea is that instead of using read_csv directly you wrap it in a function that ostensibly will do the same thing, but will also add all the info you need. It would look something like that:

extract_data <- function(txt_file){
  df <- readr::read_csv(txt_file, col_names = TRUE, col_types = "iTdc")
  well_name <- names(df) %>% `[[`(1)
  well_id <- stringi::stri_extract(well_name, regex = "[0-9]+")
  df <- df %>%
    dplyr::select(Time, `Current(feet)`)
    dplyr::mutate(well_id = well_id)
  df
}

With this function you can use purrr::map_dfr and you'll get the dataframe you need.