Extracting columns from many datasets into one main data frame. Implement the operation in a FOR-loop

Hi, all.

I would like to make seperate datasets for the years 2015, 2016 and 2017. The way I did it for one year is like the following:

Merging datasets:

df_2017<- data.frame(cars$2017, bikes$2017, motorcycle$2017, trucks$2017, cars_p$2017,
bikes_p$2017, motorcycle_p$2017, truck_p$2017)

cars, bikes, motorcycle etc are other small dataframes that I would like to extract information for only one particular year and gather the information from all the different datasets into one. So the gathered data.frame for 2017 would be df_2017. The name of the columns are the years (2015, 2016, 2017)

Does anyone know how I can implement this in a FOR-loop (or by vectorization) so I dont have to write the code over and over again?

Thanks!

You could use purrr:map_dfc() to achieve this.

Here I'm illustrating on 3 toy data frames a, b and c. Note that naming each element of df_list before calling map_dfc() ensures that each output column's name matches the tibble from where that column was extracted.

library(tidyverse)

a <- tibble(`2017` = seq(1, 10), `2018` = seq(11, 20))
b <- tibble(`2017` = seq(21, 30), `2018` = seq(31, 40))
c <- tibble(`2017` = seq(41, 50), `2018` = seq(51, 60))

df_list <- list(a = a, b = b, c = c)

map_dfc(df_list, ~ .x$`2017`)
#> # A tibble: 10 x 3
#>        a     b     c
#>    <int> <int> <int>
#>  1     1    21    41
#>  2     2    22    42
#>  3     3    23    43
#>  4     4    24    44
#>  5     5    25    45
#>  6     6    26    46
#>  7     7    27    47
#>  8     8    28    48
#>  9     9    29    49
#> 10    10    30    50

Created on 2020-04-21 by the reprex package (v0.3.0)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Thanks a lot! I will try it out!

Do you know how I can loop this sequence so that I make a tibble for each year?

I tried:

for (i in 2013:2017) {
df_i <- map_dfc(df_list, ~ .x$i)
}

But it didn't work.

Thanks

That worked perfectly! Thanks a lot!

Continuing the same example provided by Siddharth, you can do something like this:

library(purrr)
library(tibble)

a <- tibble(`2017` = 1:10, `2018` = 11:20)
b <- tibble(`2017` = 21:30, `2018` = 31:40)
c <- tibble(`2017` = 41:50, `2018` = 51:60)

df_list <- list(a = a, b = b, c = c)
year_list <- c(2017, 2018)

walk(.x = year_list,
     .f = function(year)
     {
         assign(x = paste("df", year,
                          sep = "_"),
                value = map_dfc(.x = df_list,
                                .f = ~ .x[[as.character(x = year)]]),
                envir = .GlobalEnv)
     })

df_2017
#> # A tibble: 10 x 3
#>        a     b     c
#>    <int> <int> <int>
#>  1     1    21    41
#>  2     2    22    42
#>  3     3    23    43
#>  4     4    24    44
#>  5     5    25    45
#>  6     6    26    46
#>  7     7    27    47
#>  8     8    28    48
#>  9     9    29    49
#> 10    10    30    50

df_2018
#> # A tibble: 10 x 3
#>        a     b     c
#>    <int> <int> <int>
#>  1    11    31    51
#>  2    12    32    52
#>  3    13    33    53
#>  4    14    34    54
#>  5    15    35    55
#>  6    16    36    56
#>  7    17    37    57
#>  8    18    38    58
#>  9    19    39    59
#> 10    20    40    60

There are two reasons for which your code fails.

First one is that df_i won't recognise the year, it'll simply use a variable df_i for all year, so data.frame for each year will replace the previous one. You don't want to do that.

Second, your code tries to access the column with name literally i, and not it's value. To use the value, `[[` is the better function than `$`. Also, you have to convert i to a character, as column names are character and not integer.

Hope this helps.

2 Likes

Okey, thanks! Especially on the comment on converting i to a character. That was VERY useful.
After doing that and going from $ to [[, my loop went through, but as you said it only made one data.frame. Is there a way to store the previous years?

An what should I do if I want to change the names of the columns in an effective way?