Simplest way to modify the same column in multiple dataframes in a list

Hello, hello,

while working on a personal project of mine, I confronted the following problem: I have a list_of_dataframes containing multiple dataframes, which have columns with the same names and same classes, except for one column (called m in my example below). Because of I/O issues (Excel blues...), m has different classes in different dataframes. My goal is to merge list_of_dataframes to a single dataframe using bind_rows(). To do that, I need to convert the m column of each dataframe to the same class (integer), otherwise bind_rows(list_of_dataframes) fails.

Question: which is the simplest/more readable way to convert the same variable in multiple dataframes to the same class? I show my solution below using dplyr and purrr: it's not bad, but I wonder if there's a simpler way. Note: the solution should work with a list_of_dataframes of arbitrary length, and of course with an arbitrary column name.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)

# generate sample data
n <- 100
df1 <- data.frame(x = runif(n), y = rnorm(n), m = seq_len(n))
df2 <- data.frame(x = runif(n), y = rnorm(n), m = as.character(seq_len(n)))
df3 <- data.frame(x = runif(n), y = rnorm(n), m = as.factor(seq_len(n)))
list_of_dataframes <- list(df1 = df1, df2 = df2, df3 = df3)

# directly merging the databases would fail, because column m has a different class in
# different dataframes
# dataframes <- bind_rows(list_of_dataframes) # NOT RUN

# Thus, we need to convert all columns named m to the same class
convert_all_columns_to_integer <- function(list_of_dataframes, varname){
  
  varnames <- rep_len(varname, length(list_of_dataframes))
  
  convert_to_integer <- function(dataframe, variable){
    dataframe[[variable]] <- as.integer(dataframe[[variable]])
    return(dataframe)
  }
  
  list_of_dataframes <- map2(list_of_dataframes, varnames, convert_to_integer)
  return(list_of_dataframes)
}

list_of_dataframes <- convert_all_columns_to_integer(list_of_dataframes, "m")

# now binding will be successful
dataframes <- bind_rows(list_of_dataframes)

Created on 2018-08-23 by the reprex package (v0.2.0).

1 Like

You can do this with purrr::map_dfr (which performs the bind_rows for you) and a mutate call within the map_dfr function. Looks like this:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)

# generate sample data
n <- 100
df1 <- data.frame(x = runif(n), y = rnorm(n), m = seq_len(n))
df2 <- data.frame(x = runif(n), y = rnorm(n), m = as.character(seq_len(n)))
df3 <- data.frame(x = runif(n), y = rnorm(n), m = as.factor(seq_len(n)))
list_of_dataframes <- list(df1 = df1, df2 = df2, df3 = df3)


new_df <- map_dfr(list_of_dataframes, ~{
  .x %>% 
    mutate(m = as.integer(m))
})

as_tibble(new_df)
#> # A tibble: 300 x 3
#>        x       y     m
#>    <dbl>   <dbl> <int>
#>  1 0.417  0.717      1
#>  2 0.716 -0.244      2
#>  3 0.446 -1.27       3
#>  4 0.268  1.97       4
#>  5 0.123 -0.110      5
#>  6 0.788 -0.413      6
#>  7 0.446 -0.832      7
#>  8 0.483 -1.35       8
#>  9 0.846  0.0452     9
#> 10 0.149  0.611     10
#> # ... with 290 more rows

Created on 2018-08-23 by the reprex package (v0.2.0).

5 Likes

Really nice! The only issue is that the variable name m is hard-coded, while I wanted it to be an argument. However, I guess a bit of tidyeval should fix that. I just want to double check that I understood this syntax correctly:

~{
  .x %>% 
    mutate(m = as.integer(m))
}

This is an anonymous function, right? I'm not sure about .x. Is this a reference to the element of list_of_dataframes which map_dfr is passing to the anonymous function?

2 Likes

Yes, your interpretation of that bit of code is exactly correct, as far as I understand it! The ~{.x} notation is handy when you want to write more complicated anonymous functions but don't want to separate it into its own function. If you use map2 you would use .x for the first argument and .y for the second.

You could do it with tidyeval like this:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

library(purrr)

# generate sample data
n <- 100
df1 <- data.frame(x = runif(n), y = rnorm(n), m = seq_len(n))
df2 <- data.frame(x = runif(n), y = rnorm(n), m = as.character(seq_len(n)))
df3 <- data.frame(x = runif(n), y = rnorm(n), m = as.factor(seq_len(n)))
list_of_dataframes <- list(df1 = df1, df2 = df2, df3 = df3)

my_func <- function(data, my_col){
  my_col <- enexpr(my_col)
  
  output <- data %>% 
    mutate(!!my_col := as.integer(!!my_col))
}

new_df <- map_dfr(list_of_dataframes, ~my_func(.x, m))

as_tibble(new_df)
#> # A tibble: 300 x 3
#>         x      y     m
#>     <dbl>  <dbl> <int>
#>  1 0.0728  0.652     1
#>  2 0.0534 -1.26      2
#>  3 0.735  -1.05      3
#>  4 0.305  -0.245     4
#>  5 0.746  -0.362     5
#>  6 0.101  -0.615     6
#>  7 0.0868  0.255     7
#>  8 0.865  -0.523     8
#>  9 0.818  -1.29      9
#> 10 0.0190 -1.28     10
#> # ... with 290 more rows

check_df <- map_dfr(list_of_dataframes, ~{
  .x %>% 
    mutate(m = as.integer(m))
})

identical(new_df, check_df)
#> [1] TRUE

Created on 2018-08-23 by the reprex package (v0.2.0).

8 Likes