Select multiple variables in a data frame dynamically

Hello guys.
I am a bit lost here and cannot find the solution.
Please see the code below.
I have a data frame and I want to select specific columns.
Naturally this is dummy code, but in my actual problem, I get the variables from an operation, and they get stored in a variable.

How do I make the data frame, "fastDummies_example", return only the columns found in the variable "variables_I_want"

fastDummies_example<- data.frame(numbers = 1:3,
                                  gender  = c("male", "male", "female"),
                                  animals = c("dog", "dog", "cat"),
                                  owner = c("Fernandes", "Eric", "Ivanov"),
                                  dates   = as.Date(c("2012-01-01", "2011-12-31",
                                                      "2012-01-01")),
                                  stringsAsFactors = FALSE)

fastDummies_example


variables_I_want <- c("animals", "owner")

fastDummies_example %>% select(colnames() %in% variables_I_want)

"select" takes only the column names hence your code needs to be as follows:

fastDummies_example %>% select(variables_I_want)

2 Likes

vinaychuri's solution will work. However, I would recommend using all_of() when subsetting a data frame using variable names stored as strings. This is more robust and avoids problems such as unexpected data masking.

library(dplyr, warn.conflicts = FALSE)

fastDummies_example <- data.frame(
  numbers = 1:3,
  gender = c("male", "male", "female"),
  animals = c("dog", "dog", "cat"),
  owner = c("Fernandes", "Eric", "Ivanov"),
  dates = as.Date(c("2012-01-01", "2011-12-31", "2012-01-01")),
  stringsAsFactors = FALSE
)

fastDummies_example
#>   numbers gender animals     owner      dates
#> 1       1   male     dog Fernandes 2012-01-01
#> 2       2   male     dog      Eric 2011-12-31
#> 3       3 female     cat    Ivanov 2012-01-01

variables_I_want <- c("animals", "owner")

select(fastDummies_example, all_of(variables_I_want))
#>   animals     owner
#> 1     dog Fernandes
#> 2     dog      Eric
#> 3     cat    Ivanov

Created on 2020-10-05 by the reprex package (v0.3.0)

1 Like

Sure Anirban. I'll post here as the OP may find it useful.

The issue is that data variables always have priority and can end up masking environment variables if they have the same name. Consider the example below.

my_mtcars <- mtcars[1:4, ]

vars <- c("cyl", "am", "vs")

# This works (with a note).
dplyr::select(my_mtcars, vars)
#> Note: Using an external vector in selections is ambiguous.
#> i Use `all_of(vars)` instead of `vars` to silence this message.
#> i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
#>                cyl am vs
#> Mazda RX4        6  1  0
#> Mazda RX4 Wag    6  1  0
#> Datsun 710       4  1  1
#> Hornet 4 Drive   6  0  1

# But let's say my_mtcars contains a column named vars.
my_mtcars$vars <- 1:4

# This gives a different result now because the data variable vars masks the
# environment variable vars.
dplyr::select(my_mtcars, vars)
#>                vars
#> Mazda RX4         1
#> Mazda RX4 Wag     2
#> Datsun 710        3
#> Hornet 4 Drive    4

# To disambiguate and force the environment variable, use all_of().
dplyr::select(my_mtcars, all_of(vars))
#>                cyl am vs
#> Mazda RX4        6  1  0
#> Mazda RX4 Wag    6  1  0
#> Datsun 710       4  1  1
#> Hornet 4 Drive   6  0  1

Created on 2020-10-06 by the reprex package (v0.3.0)

This ambiguity is usually not a problem in interactive data analysis when you know what variables your data contains but it is very relevant for package development since you have no idea what variables will be present in the data.

The tidyverse maintainers have indicated that this approach of supplying strings to selections without explicitly specifying whether you are referring to data or environment variables will be deprecated at some point in the future, so it is a good idea to start using all_of() in these situations.

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.