multiple datasets with for loop

xDAlexa · December 29, 2018, 6:38pm

I want to import 3 csv datasets using a for loop.
This is what i did already:
setwd("C:/Users/.../")

DATEIEN <- list.files(path="C:/Users/.../", pattern="*.csv")

for (i in 1:length(DATEIEN)){

DATEN <- read.table(file=paste0("C:/Users/.../", DATEIEN[i]),
fill = TRUE,
header=FALSE,
sep=",",
dec = ".",
stringsAsFactors = TRUE)

print(DATEIEN[i])
}

My problem is, that the environment only shows one dataset a as data called "DATEN". But i want 3 datasets called e.g. DATEN1, DATEN2, DATEN3.
How can i cahnge my loop to get this?

stewart.ross · December 30, 2018, 11:15am

It is simpler if you don't use a for loop but instead use one of the *apply functions to generate a list with all three files within it. That way you don't have to create three separate variables in your global environment when there is no need to do so.

As we don't have access to your data I need to set up some test data for a reproducible example:

df = read.table(text = 
                  "a,b,c
                   1,2,3
                   4,5,6", 
                header = TRUE, 
                sep = ",")

for(i in 1:3){
  write.csv(df, paste0("test", i, ".csv"), row.names = FALSE)
}

Now that there are some csv files created these can be read in one step using an anonymous function within sapply, a variant of lapply which I've used to retain the csv file names as the names of the individual list elements.

Note that I have left out the path components you've listed, as the stated path is specific to your circumstances and not directly reproducible. You would have to put the path arguments back in to make the example apply in your own case. I have also read in the csv files without using their headers, to keep the code as close to your example as possible.

DATEIEN <- list.files(pattern="*.csv")                

DATEN <- sapply(DATEIEN, 
                function(x)read.table(file=paste0(x),
                                      fill = TRUE,
                                      header=FALSE,
                                      sep=",",
                                      dec = ".",
                                      stringsAsFactors = TRUE), 
                USE.NAMES = TRUE, 
                simplify = FALSE)

Your DATEN variable is then a list with three elements, each of which is named as per the name of the csv file that was read in:

length(DATEN)
[1] 3
DATEN
$test1.csv
  V1 V2 V3
1  a  b  c
2  1  2  3
3  4  5  6

$test2.csv
  V1 V2 V3
1  a  b  c
2  1  2  3
3  4  5  6

$test3.csv
  V1 V2 V3
1  a  b  c
2  1  2  3
3  4  5  6

As with all lists, you can refer to the individual elements by element number:

DATEN[[1]]
  V1 V2 V3
1  a  b  c
2  1  2  3
3  4  5  6

In summary, you don't need to create separate variables to store the contents of each csv file. Instead, use a single list variable to do so. The elements of the list can be accessed individually for further processing.

xDAlexa · January 3, 2019, 7:52pm

Hi, thank you so much for your answer! I applied your solution to my data and it worked well.
It would be very kind if you can answer one more question:
How can i refer to all three DATEN files at the same time for further processing, e.g. to remove column V3? Or to calculate V2 - V1 and put the result in a new column V4?

cderv · January 3, 2019, 10:10pm

Here is another approach not using for loop, but using tidyverse tools.
These will allow you to continue processing further.

I took the example back for previous answer

df = read.table(text = 
                  "a,b,c
                   1,2,3
                   4,5,6", 
                header = TRUE, 
                sep = ",")

for(i in 1:3){
  write.csv(df, file.path(tempdir(), paste0("test", i, ".csv")), row.names = FALSE)
}

Using tidyverse

library(tidyverse)
#> Warning: le package 'purrr' a été compilé avec la version R 3.5.2

# get the file path
DATEIEN <- list.files(tempdir(), pattern = "test\\d\\.csv$", full.names = TRUE)
# use a name list for row bind later
DATEIEN <- set_names(DATEIEN, paste0("DATEN", 1:3))

DATEN <- DATEIEN %>%
  # read in loop and row bind
  map_dfr(read_csv, .id = 'dataset')
#> Parsed with column specification:
#> cols(
#>   a = col_double(),
#>   b = col_double(),
#>   c = col_double()
#> )
#> Parsed with column specification:
#> cols(
#>   a = col_double(),
#>   b = col_double(),
#>   c = col_double()
#> )
#> Parsed with column specification:
#> cols(
#>   a = col_double(),
#>   b = col_double(),
#>   c = col_double()
#> )
DATEN
#> # A tibble: 6 x 4
#>   dataset     a     b     c
#>   <chr>   <dbl> <dbl> <dbl>
#> 1 DATEN1      1     2     3
#> 2 DATEN1      4     5     6
#> 3 DATEN2      1     2     3
#> 4 DATEN2      4     5     6
#> 5 DATEN3      1     2     3
#> 6 DATEN3      4     5     6

you’ll then be able to apply some operation, by dataset using group_by

# sum of all column for each DATENi
DATEN %>%
  group_by(dataset) %>%
  summarise_all(sum)
#> # A tibble: 3 x 4
#>   dataset     a     b     c
#>   <chr>   <dbl> <dbl> <dbl>
#> 1 DATEN1      5     7     9
#> 2 DATEN2      5     7     9
#> 3 DATEN3      5     7     9

# remove c, and do b - a
DATEN %>%
  select(-c) %>%
  mutate(d = b - a)
#> # A tibble: 6 x 4
#>   dataset     a     b     d
#>   <chr>   <dbl> <dbl> <dbl>
#> 1 DATEN1      1     2     1
#> 2 DATEN1      4     5     1
#> 3 DATEN2      1     2     1
#> 4 DATEN2      4     5     1
#> 5 DATEN3      1     2     1
#> 6 DATEN3      4     5     1

you also can continue using list and purrr, i.e not row bind to a data.frame

DATEN_list <- DATEIEN %>%
  map(read_csv, col_types = "iii")
str(DATEN_list, give.attr = FALSE)
#> List of 3
#>  $ DATEN1:Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':   2 obs. of  3 variables:
#>   ..$ a: int [1:2] 1 4
#>   ..$ b: int [1:2] 2 5
#>   ..$ c: int [1:2] 3 6
#>  $ DATEN2:Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':   2 obs. of  3 variables:
#>   ..$ a: int [1:2] 1 4
#>   ..$ b: int [1:2] 2 5
#>   ..$ c: int [1:2] 3 6
#>  $ DATEN3:Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':   2 obs. of  3 variables:
#>   ..$ a: int [1:2] 1 4
#>   ..$ b: int [1:2] 2 5
#>   ..$ c: int [1:2] 3 6

# sum of all columns for each DATENi
DATEN_list %>%
  map(summarise_all, sum)
#> $DATEN1
#> # A tibble: 1 x 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     5     7     9
#> 
#> $DATEN2
#> # A tibble: 1 x 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     5     7     9
#> 
#> $DATEN3
#> # A tibble: 1 x 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     5     7     9

# remove c, and do b - a
DATEN_list %>%
  map(~ select(.x, - c) %>% 
        mutate(d = b - a)
      )
#> $DATEN1
#> # A tibble: 2 x 3
#>       a     b     d
#>   <int> <int> <int>
#> 1     1     2     1
#> 2     4     5     1
#> 
#> $DATEN2
#> # A tibble: 2 x 3
#>       a     b     d
#>   <int> <int> <int>
#> 1     1     2     1
#> 2     4     5     1
#> 
#> $DATEN3
#> # A tibble: 2 x 3
#>       a     b     d
#>   <int> <int> <int>
#> 1     1     2     1
#> 2     4     5     1

^{Created on 2019-01-03 by the reprex package (v0.2.1)}

system · January 24, 2019, 10:10pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.