How to automate multiple csv files and how to work on them

Hi,

Please I need help to automate this process. I have multiple csv files for each pollutant. So for a folder called "C02", it has about 50 csv files ( each csv file corresponds to a year), the other pollutant have similar number of csv files. The attached is the code I have been using. I want to automate this ( for the csv files for each pollutant) since that is going to save time and resources.

#===== 1980 ======
hap1980 <- read.csv("daily_HAPS_1980.csv")

t80 <- hap1980[hap1980$State.Name %in% c("Michigan", "Ohio","Indiana","Pennsylvania") , c('State.Name','County.Code','County.Name', 'Longitude', 'Latitude','Arithmetic.Mean')]

cord <- c("Latitude", "Longitude")
t80$GrpHAP <- do.call(paste,c(t80[cord], sep=" , "))

tm80 <- t80 %>%
group_by(State.Name,County.Name,Latitude,Longitude, GrpHAP) %>%
summarise(Trimmed_mean= mean(Arithmetic.Mean, trim=0.1), Average = mean(Arithmetic.Mean))

write.csv(tm80,"path\Timmedmeans_1980.csv", row.names = FALSE)

#=========================== 1981 =================================

hap1981 <- read.csv("daily_HAPS_1981.csv")

t81 <- hap1981[hap1981$State.Name %in% c("Michigan", "Ohio","Indiana","Pennsylvania") , c('State.Name','County.Code','County.Name', 'Longitude', 'Latitude','Arithmetic.Mean')]

cord <- c("Latitude", "Longitude")
t81$GrpHAP <- do.call(paste,c(t81[cord], sep=" , "))

tm81 <- t81 %>%
group_by(State.Name,County.Name,Latitude,Longitude, GrpHAP) %>%
summarise(Trimmed_mean= mean(Arithmetic.Mean, trim=0.1), Average = mean(Arithmetic.Mean))

write.csv(tm81,"path\Timmedmeans_1981.csv", row.names = FALSE)

#=========================== 1982 =================================

hap1982 <- read.csv("daily_HAPS_1982.csv")

t82 <- hap1982[hap1982$State.Name %in% c("Michigan", "Ohio","Indiana","Pennsylvania") , c('State.Name','County.Code','County.Name', 'Longitude', 'Latitude','Arithmetic.Mean')]

cord <- c("Latitude", "Longitude")
t82$GrpHAP <- do.call(paste,c(t82[cord], sep=" , "))

tm82 <- t82 %>%
group_by(State.Name,County.Name,Latitude,Longitude, GrpHAP) %>%
summarise(Trimmed_mean= mean(Arithmetic.Mean, trim=0.1), Average = mean(Arithmetic.Mean))

write.csv(tm82,"path\Timmedmeans_1982.csv", row.names = FALSE)

#=========================== 1983 =================================

hap1983 <- read.csv("daily_HAPS_1983.csv")

t83 <- hap1983[hap1983$State.Name %in% c("Michigan", "Ohio","Indiana","Pennsylvania") , c('State.Name','County.Code','County.Name', 'Longitude', 'Latitude','Arithmetic.Mean')]

cord <- c("Latitude", "Longitude")
t83$GrpHAP <- do.call(paste,c(t83[cord], sep=" , "))

tm83 <- t83 %>%
group_by(State.Name,County.Name,Latitude,Longitude, GrpHAP) %>%
summarise(Trimmed_mean= mean(Arithmetic.Mean, trim=0.1), Average = mean(Arithmetic.Mean))

write.csv(tm83,"path\Timmedmeans_1983.csv", row.names = FALSE)

There are many ways to do it, I think the simplest is with an explicit for loop.

First, make a list of all the files that need processing:

all_files <- list.files("path",  pattern = "csv$")

for(current_file in all_files){
   hap <- read.csv(current_file)

 [...]

  write.csv(tm, paste0("path/Timmedmeans_", current_file, ".csv"), row.names = FALSE)
}

And you fill in the code for a single file within the for loop.

You can find a lot more explanations in r4ds.

3 Likes

Thank you AlexisW. I would give it a try and let you know how it goes

Hi ,

I tried the code but i keep getting errors. Please be patient if the error seems obvious, I'm very new to R and to for loops in general.

all_files <- list.files("path", pattern = "csv$")

results <- vector("list", length(current_file))
for(current_file in all_files){
hap <- read.csv(current_file)
count <- nrow(hap)
t82 <- hap[hap$State.Name %in% c("Michigan", "Ohio","Indiana","Pennsylvania") , c('State.Name','County.Code','County.Name', 'Longitude', 'Latitude','Arithmetic.Mean')]

cord <- c("latitude", "longitude")
t82$grphap <- do.call(paste,c(t82[cord], sep=" , "))

tm82 <- t82 %>%
group_by(state.name,county.name,latitude,longitude, grphap) %>%
summarise(trimmed_mean= mean(arithmetic.mean, trim=0.1), average = mean(arithmetic.mean))

write.csv(tm, paste0("path/timmedmeans_", current_file, ".csv"), row.names = FALSE)
results[current_file] <- count
}

Have a look at this perhaps?

Yes williaml I have . I tried it but it didn't get the right output

We don't have any of your files anyway, so can't reproduce your issue. It might help if you posted the error that you are having.

Thank you guys. It is working now. Yyaayyyyyyy :laughing: :smiley:. Thank you !!!!!!

1 Like

Here is what I do in a similar situation

file_template <- 
  "https://info.gesundheitsministerium.gv.at/data/archiv/COVID19_vaccination_doses_agegroups_xxxxxxxx.csv"

seq(from=as.Date("2021-10-29"),to=Sys.Date(),by="1 day") |>
  as.character() |>
  str_remove_all("-") ->
  date_seq

str_replace(file_template,"xxxxxxxx",date_seq) ->
  files_to_get

map(files_to_get,
    \(f)read_csv2(f)
    ) |>
  reduce(bind_rows) ->
  vac_age_doses_data 

Another approach is to store the list of csvs in a dataframe see example below.

library(tidyverse)
data.frame(co2_path = list.files(full.names=TRUE, pattern= ".csv")) %>%
mutate(co2_files = map(co2_path, read_csv))%>%
unnest(coa_files)%>%
filter('State Name' %in% c ("Michigan","Ohio", "Indiana", "Pennsylvania"))

This way, you will have one object to conduct analysis on. It will help you if you need to check your code later for one reason or the other.

Hi, for big files I use data.table (fread) and furrr (future_map_dfr) packages. Something like this:

library(data.table)
library(furrr)

csv_files_path <- list.files(path = "your_folder_path", pattern = "*.csv", full.names = TRUE)

plan(multisession)
df <- future_map_dfr(csv_files_path, ~fread(.x))

Does it help?

Javi