referring to variables by name in selecting functions

There was a change in tidyselect some time ago regarding using variables in selecting statements (see: Using an external vector in selections is ambiguous).

The old method to select a column using a variable was to use:
some_df %>% select(var_col_name)

The new approach is to use:
some_df %>% select(all_of(var_col_name))

I used to convert variable column names to static names with something like:
some_df %>% select(new_col_name = var_col_name)

What is the correct way to rename variable columns if I can't refer to them directly anymore?

1 Like

Just to clarify, is var_col_name in your example a data variable in df or a character vector of variable names?

suppressPackageStartupMessages({
  library(dplyr)
})

# still good
mtcars %>% select(mpg)
#>                      mpg
#> Mazda RX4           21.0
#> Mazda RX4 Wag       21.0
#> Datsun 710          22.8
#> Hornet 4 Drive      21.4
#> Hornet Sportabout   18.7
#> Valiant             18.1
#> Duster 360          14.3
#> Merc 240D           24.4
#> Merc 230            22.8
#> Merc 280            19.2
#> Merc 280C           17.8
#> Merc 450SE          16.4
#> Merc 450SL          17.3
#> Merc 450SLC         15.2
#> Cadillac Fleetwood  10.4
#> Lincoln Continental 10.4
#> Chrysler Imperial   14.7
#> Fiat 128            32.4
#> Honda Civic         30.4
#> Toyota Corolla      33.9
#> Toyota Corona       21.5
#> Dodge Challenger    15.5
#> AMC Javelin         15.2
#> Camaro Z28          13.3
#> Pontiac Firebird    19.2
#> Fiat X1-9           27.3
#> Porsche 914-2       26.0
#> Lotus Europa        30.4
#> Ford Pantera L      15.8
#> Ferrari Dino        19.7
#> Maserati Bora       15.0
#> Volvo 142E          21.4

# needs quotation
mtcars %>% select(all_of("mpg"))
#>                      mpg
#> Mazda RX4           21.0
#> Mazda RX4 Wag       21.0
#> Datsun 710          22.8
#> Hornet 4 Drive      21.4
#> Hornet Sportabout   18.7
#> Valiant             18.1
#> Duster 360          14.3
#> Merc 240D           24.4
#> Merc 230            22.8
#> Merc 280            19.2
#> Merc 280C           17.8
#> Merc 450SE          16.4
#> Merc 450SL          17.3
#> Merc 450SLC         15.2
#> Cadillac Fleetwood  10.4
#> Lincoln Continental 10.4
#> Chrysler Imperial   14.7
#> Fiat 128            32.4
#> Honda Civic         30.4
#> Toyota Corolla      33.9
#> Toyota Corona       21.5
#> Dodge Challenger    15.5
#> AMC Javelin         15.2
#> Camaro Z28          13.3
#> Pontiac Firebird    19.2
#> Fiat X1-9           27.3
#> Porsche 914-2       26.0
#> Lotus Europa        30.4
#> Ford Pantera L      15.8
#> Ferrari Dino        19.7
#> Maserati Bora       15.0
#> Volvo 142E          21.4
                  
# old way
mtcars[,"mpg"]
#>  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
#> [31] 15.0 21.4
           
# will work

mtcars %>% select(mileage = mpg)      
#>                     mileage
#> Mazda RX4              21.0
#> Mazda RX4 Wag          21.0
#> Datsun 710             22.8
#> Hornet 4 Drive         21.4
#> Hornet Sportabout      18.7
#> Valiant                18.1
#> Duster 360             14.3
#> Merc 240D              24.4
#> Merc 230               22.8
#> Merc 280               19.2
#> Merc 280C              17.8
#> Merc 450SE             16.4
#> Merc 450SL             17.3
#> Merc 450SLC            15.2
#> Cadillac Fleetwood     10.4
#> Lincoln Continental    10.4
#> Chrysler Imperial      14.7
#> Fiat 128               32.4
#> Honda Civic            30.4
#> Toyota Corolla         33.9
#> Toyota Corona          21.5
#> Dodge Challenger       15.5
#> AMC Javelin            15.2
#> Camaro Z28             13.3
#> Pontiac Firebird       19.2
#> Fiat X1-9              27.3
#> Porsche 914-2          26.0
#> Lotus Europa           30.4
#> Ford Pantera L         15.8
#> Ferrari Dino           19.7
#> Maserati Bora          15.0
#> Volvo 142E             21.4

That works, but what I was trying to do is:

var_col_name = "mpg"
mtcars %>% select(mileage = var_col_name)

That causes a warning now.

Hello!

I thought you needed the bang bang operator to do what you want to do:

var_col_name = "mpg"
mtcars %>% select(mileage = !!var_col_name)
suppressPackageStartupMessages({
  library(dplyr)
})
var_col_name = "mpg"
mtcars %>% select(mileage = var_col_name)
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(var_col_name)` instead of `var_col_name` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
#>                     mileage
#> Mazda RX4              21.0
#> Mazda RX4 Wag          21.0
#> Datsun 710             22.8
#> Hornet 4 Drive         21.4
#> Hornet Sportabout      18.7
#> Valiant                18.1
#> Duster 360             14.3
#> Merc 240D              24.4
#> Merc 230               22.8
#> Merc 280               19.2
#> Merc 280C              17.8
#> Merc 450SE             16.4
#> Merc 450SL             17.3
#> Merc 450SLC            15.2
#> Cadillac Fleetwood     10.4
#> Lincoln Continental    10.4
#> Chrysler Imperial      14.7
#> Fiat 128               32.4
#> Honda Civic            30.4
#> Toyota Corolla         33.9
#> Toyota Corona          21.5
#> Dodge Challenger       15.5
#> AMC Javelin            15.2
#> Camaro Z28             13.3
#> Pontiac Firebird       19.2
#> Fiat X1-9              27.3
#> Porsche 914-2          26.0
#> Lotus Europa           30.4
#> Ford Pantera L         15.8
#> Ferrari Dino           19.7
#> Maserati Bora          15.0
#> Volvo 142E             21.4

You're right! That works without any extra messages/warnings.

It seems curly-curly also works:

mtcars %>% select(mileage = {{var_col_name}})

If you have string inputs, as was the case in this example, apparently you should be using the .data pronoun (source):

mtcars %>% select(mileage = .data[[var_col_name]])

I just realized that you were changing the name of the column in the select() function. I have to admit that I would never do this in select() because that is why you have mutate() and rename(). However, I would only do this if I was creating a final table and no more changes to the data were taking place but the users needed different column names. Nice to know that this can be done.

Since you brought up the .data pronoun, do you know the difference between .data and .? The above could have been written as:

mtcars %>% select(mileage = .[[var_col_name]])

.data takes the data.frame from the previous pipe, where as . takes the original data set that begins the pipe. This matters if you have several filters, mutations, and summaries as you pipe your data along. Just an FYI if you see the . in code.

1 Like

This understanding is incorrect. . refers to the object on the LHS of the "current stage" of the pipe. It does not refer to the original data set at the start of the pipeline. You can verify this with the following code.

library(dplyr, warn.conflicts = FALSE)
#> Warning: package 'dplyr' was built under R version 4.0.4

# If . referred to the original data.frame, this code should work. But it does not.
iris %>% 
  select(-Species) %>% 
  rename(Species_new = .[[Species]])
#> Error: object 'Species' not found

Created on 2021-03-21 by the reprex package (v1.0.0)

.data is a pronoun; a special construct which is used to disambiguate between data variables and environment variables. See this post for a detailed write-up on this topic.

2 Likes

I was trying to simplify the process (one command instead of two). The warning happens either way, though.

I think you are confusing how tidyverse selects variables with non-standard evaluation and the how base R's subsetting operator []. This operator takes an index or a character value, the column name. Please refer to AdvancedR for a much more thorough explanation.

Your example fails because the operator searches the global environment for an object names Species, not in the dataframe that is passed along. Using your example, lets see what happens when you supply the argument that it expects, and the issue of the .:

When you pass a character to the [] operator using the ., it does what you would "not" expect since you de-selected the Species column. That is because it references the original data frame.

> iris %>% 
+   select(-Species) %>% 
+   rename(Species_new = .[["Species"]]) %>% head()
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4
> 

When you pass a character to the [] operator using the .data, it does what you "would" expect since you de-selected the Species column. That is because it references the piped data frame.

> iris %>% 
+   select(-Species) %>% 
+   rename(Species_new = .data[["Species"]]) %>% head()
Error: Column `Species` not found in `.data`
Run `rlang::last_error()` to see where the error occurred.

If we add an object named Species <- "Species", your code would then work even though you de-selected it in the pipeline. That can be quite dangerous if some of you objects have the same name as your column names.

> Species <- "Species"
> iris %>% 
+   select(-Species) %>% 
+   rename(Species_new = .[[Species]]) %>% head()
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4

I hope this helps. I know when I first started using R, I thought the same, that .data and . where the same thing. But then I started getting results that I did not expect and with further research and experience I discovered the above.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.