Iteration for several files

nchan08 · April 14, 2021, 4:37am

I am a beginner to analyze several files together. I have more than 1000 files. I created the following codes:

files <- dir(".", pattern = ".csv$")  # To get the names of all csv files in current directory
for (i in 1:length(files)) {
  obj_name <- files %>% str_sub(end = -5)
  assign(obj_name[i], read_csv(files[i]))
}
# Concatenate the imported files into a list to manipulate them at once
command <- paste0("RawList <- list(", paste(obj_name, collapse = ","), ")")
eval(parse(text = command))
rm(i, obj_name, command, list = ls(pattern = "^a20"))
YMD <- files %>% str_sub(2, 9)

#To check for i== case 1
i <- 1
df <- RawList[[i]] %>% 
  pivot_longer(cols = -AA, names_to = "time_sec", values_to = "BB") %>% # change into the long format
  mutate(time_sec = paste(YMD[i], time_sec) %>% ymd_hms())%>% 
  mutate(minute = format(as.POSIXct(time_sec,format="%H:%M:%S"),"%M"))

#To run the whole dataset
Ref_com <- RawList
for (i in 1:length(RawList)) {
  df1 <- RawList[[i]] %>% 
    pivot_longer(cols = -AA, names_to = "time_sec", values_to = "BB") %>% # change into the long format
    mutate(time_sec = paste(YMD[i], time_sec) %>% ymd_hms())%>% 
    mutate(minute = format(as.POSIXct(time_sec,format="%H:%M:%S"),"%M")) 
}

After running the whole dataset, I got only the result for the last file.
Could you please advice where I need to change the code?
Thanks in advance!

cactusoxbird · April 14, 2021, 5:31am

It looks like the problem is that in your last for() loop you are overwriting the df1 object each time, so it will only show the last value in your loop. Something like this might help if you want a list as your output:

#To run the whole dataset
Ref_com <- RawList
df1 <- list()
for (i in 1:length(RawList)) {
  df1[[i]] <- RawList[[i]] %>% 
    pivot_longer(cols = -AA, names_to = "time_sec", values_to = "BB") %>% # change into the long format
    mutate(time_sec = paste(YMD[i], time_sec) %>% ymd_hms())%>% 
    mutate(minute = format(as.POSIXct(time_sec,format="%H:%M:%S"),"%M")) 
}

However, I think you might be able to simplify your code a bit too. It's hard to say without knowing what your csv files look like, but here's an example of how you might iterate over many files more simply using the purrr package:

library(tidyverse)
# Work easily with dates and times
library(lubridate)

# This creates empty files to use for this example
walk(.x = 1:100,
     .f = ~ tibble(AA = .x,
                   time_1 = Sys.time(),
                   time_2 = Sys.time()) %>%
       write_csv(x = .,
                 file = paste0("file_",
                               # Date part of the name
                               format(Sys.Date(), "%Y%m%d"),
                               "_",
                               .x,
                               ".csv")))

# Get list of all csv files
files <- dir("./", 
             pattern = ".csv$")

# Read the files into R as a list
file_list <- map(.x = files,
                 .f = ~ read_csv(.x))

# Perform the same operation on every file in the list
manipulated_files <- map2(.x = file_list,
                          # Create a list of dates to use
                          .y = str_sub(string = files, start = 6, end = 13),
                          .f = ~ .x %>%
                            pivot_longer(cols = -AA,
                                         names_to = "time_sec",
                                         values_to = "BB") %>%
                            mutate(date = ymd(.y),
                                   minute = minute(BB)))

# Example result:
manipulated_files[1]
#> [[1]]
#> # A tibble: 2 x 5
#>      AA time_sec BB                  date       minute
#>   <dbl> <chr>    <dttm>              <date>      <int>
#> 1     1 time_1   2021-04-14 05:28:31 2021-04-13     28
#> 2     1 time_2   2021-04-14 05:28:31 2021-04-13     28


# OR, if you'd like to create a single data frame from all of the files, you can use
# map2_df instead:
manipulated_files_df <- map2_df(.x = file_list,
                                # Create a list of dates to use
                                .y = str_sub(string = files, start = 6, end = 13),
                                .f = ~ .x %>%
                                  pivot_longer(cols = -AA,
                                               names_to = "time_sec",
                                               values_to = "BB") %>%
                                  mutate(date = ymd(.y),
                                         minute = minute(BB)))



# Example result:
head(manipulated_files_df)
#> # A tibble: 6 x 5
#>      AA time_sec BB                  date       minute
#>   <dbl> <chr>    <dttm>              <date>      <int>
#> 1     1 time_1   2021-04-14 05:28:31 2021-04-13     28
#> 2     1 time_2   2021-04-14 05:28:31 2021-04-13     28
#> 3    10 time_1   2021-04-14 05:28:31 2021-04-13     28
#> 4    10 time_2   2021-04-14 05:28:31 2021-04-13     28
#> 5   100 time_1   2021-04-14 05:28:31 2021-04-13     28
#> 6   100 time_2   2021-04-14 05:28:31 2021-04-13     28

^{Created on 2021-04-13 by the reprex package (v1.0.0)}

nchan08 · April 14, 2021, 7:03am

Even though I used list() function, it is still overwriting.

`summarise()` regrouping output by 'AA' (override with `.groups` argument)

When it runs, the above-mentioned code appears.

Also, I can't use the following code:
df[[1]] <- RawList[[i]]
I think I need to rewrite the code.

cactusoxbird · April 14, 2021, 6:45pm

In order for it to store every object you iterate over you need df[[i]], not df[[1]]. Using df[[1]] stores in the 1 position in your list each time. Using i stores each object in a separate index position so nothing is overwritten.

nchan08 · April 15, 2021, 5:24am

Thank you, @cactusoxbird,
If I used your code, the following error occurs:

Error: Assigned data `\`%>%\`(...)` must be compatible with existing data.
x Existing data has 743424 rows.
x Assigned data has 727040 rows.
i Only vectors of size 1 are recycled.

I don't know which part makes error.

system · May 6, 2021, 5:24am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.