Accessing Newly created variables in a dplyr function

dplyr

#1

I am writing a function to calculate odds ratios for a table of counts that requires NSE evaluation with dplyr and tidyr:

OR_tab <- function(dat, strat, grp, decision ){
  strat <- enquo(strat)
  grp <- enquo(grp)
  decision <- enquo(decision)

  tab <- dat %>% count(!!strat, !!grp, !!decision) %>% unite(cat, c(!!grp,
   !!decision)) %>%  spread(cat, n, fill = 0)
  nm <- names(tab)[2:5]
  tab %>% mutate(OR = (tab$`!!`(nm[1]) * tab$`!!`(nm[4])) / (tab$`!!`(nm[2]) * (tab$`!!`(nm[3])))) %>%   print(n = Inf)
}

For example, with a dataframe ‘selections’:

# A tibble: 4 x 3
   strata      group         select
    <chr>    <chr>          <chr>
1 Manager A_Group    Chosen
2  Worker   A_Group    Chosen
3 Manager B_Group    Not_Chosen
4  Worker   B_Group    Chosen
5 ...

calling my function with `OR_tab(selections, strata, group, select)``` yields:

# A tibble: 2 x 6
  strata  A_Group_Chosen A_Group_Not_Chosen B_Group_Chosen B_Group_Not_Chosen    OR
  <chr>            <dbl>              <dbl>          <dbl>              <dbl> <dbl>
1 Manager         1.00               9.00           1.00               3.00 0.333
2 Worker            1.00              11.0            1.00               3.00 0.273

Question: Is there a better way to capture the new variable names created with tidyr’s unite and spread functions? Grabbing the names and using the back tick stuff seems really kludgy.

BTW, I posted this unsuccessfully on Stack Overflow (https://stackoverflow.com/questions/47063003/accessing-new-variables-in-a-dplyr-function). Thanks.


#2

Hi @jbannon, This will be much easier to troubleshoot if you can turn it into reprex (short for minimal reproducible example). (If for no other reason than it being helpful for following the formatting visually).

If you’ve never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don’ts, and the post I’m linking to below are also useful.

Thanks!


#3
library(tidyverse)
#> -- Attaching packages ---------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
#> v ggplot2 2.2.1     v purrr   0.2.4
#> v tibble  1.4.2     v dplyr   0.7.4
#> v tidyr   0.8.0     v stringr 1.3.0
#> v readr   1.1.1     v forcats 0.2.0
#> -- Conflicts ------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
options(tibble.width = Inf)

OR_tab <- function(dat, strat, grp, decision ){
  strat <- enquo(strat)
  grp <- enquo(grp)
  decision <- enquo(decision)
  tab <- dat %>% count(!!strat, !!grp, !!decision) %>% unite(cat, c(!!grp, !!decision)) %>% 
    spread(cat, n, fill = 0)
  nm <- names(tab)[2:5]
  tab <- tab %>% mutate(OR = (tab$`!!`(nm[1]) * tab$`!!`(nm[4])) / (tab$`!!`(nm[2]) * (tab$`!!`(nm[3]))))
}

df <- structure(list(strata = c("Manager", "Worker", "Manager", "Manager", 
                                "Worker", "Manager", "Manager", "Manager", "Worker", "Worker", 
                                "Worker", "Worker", "Worker", "Worker", "Manager", "Worker", 
                                "Worker", "Manager", "Manager", "Manager", "Worker", "Worker", 
                                "Manager", "Manager", "Manager", "Manager", "Worker", "Worker", 
                                "Worker", "Worker"), group = c("A_Group", "A_Group", "A_Group", 
                                "A_Group", "B_Group", "A_Group", "B_Group", "A_Group", "A_Group", 
                                "A_Group", "A_Group", "A_Group", "B_Group", "B_Group", "A_Group", 
                                "A_Group", "A_Group", "A_Group", "A_Group", "B_Group", "A_Group", 
                                "A_Group", "B_Group", "B_Group", "A_Group", "A_Group", "B_Group", 
                                "A_Group", "A_Group", "A_Group"), select = c("Chosen", "Chosen", 
                                "Not_Chosen", "Not_Chosen", "Not_Chosen", "Not_Chosen", "Not_Chosen", 
                                "Not_Chosen", "Not_Chosen", "Not_Chosen", "Not_Chosen", "Not_Chosen", 
                                "Not_Chosen", "Not_Chosen", "Not_Chosen", "Not_Chosen", "Not_Chosen", 
                                "Not_Chosen", "Not_Chosen", "Not_Chosen", "Not_Chosen", "Not_Chosen", 
                                "Not_Chosen", "Chosen", "Not_Chosen", "Not_Chosen", "Chosen", 
                                "Not_Chosen", "Not_Chosen", "Not_Chosen")), 
                .Names = c("strata", "group", "select"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -30L))

result <- OR_tab(df,strata, group, select)
result
#> # A tibble: 2 x 6
#>   strata  A_Group_Chosen A_Group_Not_Chosen B_Group_Chosen
#>   <chr>            <dbl>              <dbl>          <dbl>
#> 1 Manager           1.00               9.00           1.00
#> 2 Worker            1.00              11.0            1.00
#>   B_Group_Not_Chosen    OR
#>                <dbl> <dbl>
#> 1               3.00 0.333
#> 2               3.00 0.273
Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.3 (2017-11-30)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  tz       America/New_York            
#>  date     2018-02-22
#> Packages -----------------------------------------------------------------
#>  package    * version date       source        
#>  assertthat   0.2.0   2017-04-11 CRAN (R 3.4.1)
#>  backports    1.1.1   2017-09-25 CRAN (R 3.4.1)
#>  base       * 3.4.3   2017-12-06 local         
#>  bindr        0.1     2016-11-13 CRAN (R 3.4.1)
#>  bindrcpp   * 0.2     2017-06-17 CRAN (R 3.4.1)
#>  broom        0.4.3   2017-11-20 CRAN (R 3.4.3)
#>  cellranger   1.1.0   2016-07-27 CRAN (R 3.4.1)
#>  cli          1.0.0   2017-11-05 CRAN (R 3.4.2)
#>  colorspace   1.3-2   2016-12-14 CRAN (R 3.4.1)
#>  compiler     3.4.3   2017-12-06 local         
#>  crayon       1.3.4   2017-09-16 CRAN (R 3.4.1)
#>  datasets   * 3.4.3   2017-12-06 local         
#>  devtools     1.13.5  2018-02-18 CRAN (R 3.4.3)
#>  digest       0.6.12  2017-01-27 CRAN (R 3.4.1)
#>  dplyr      * 0.7.4   2017-09-28 CRAN (R 3.4.2)
#>  evaluate     0.10.1  2017-06-24 CRAN (R 3.4.1)
#>  forcats    * 0.2.0   2017-01-23 CRAN (R 3.4.1)
#>  foreign      0.8-69  2017-06-22 CRAN (R 3.4.3)
#>  ggplot2    * 2.2.1   2016-12-30 CRAN (R 3.4.1)
#>  glue         1.2.0   2017-10-29 CRAN (R 3.4.3)
#>  graphics   * 3.4.3   2017-12-06 local         
#>  grDevices  * 3.4.3   2017-12-06 local         
#>  grid         3.4.3   2017-12-06 local         
#>  gtable       0.2.0   2016-02-26 CRAN (R 3.4.1)
#>  haven        1.1.1   2018-01-18 CRAN (R 3.4.3)
#>  hms          0.3     2016-11-22 CRAN (R 3.4.1)
#>  htmltools    0.3.6   2017-04-28 CRAN (R 3.4.1)
#>  httr         1.3.1   2017-08-20 CRAN (R 3.4.1)
#>  jsonlite     1.5     2017-06-01 CRAN (R 3.4.1)
#>  knitr        1.20    2018-02-20 CRAN (R 3.4.3)
#>  lattice      0.20-35 2017-03-25 CRAN (R 3.4.3)
#>  lazyeval     0.2.1   2017-10-29 CRAN (R 3.4.2)
#>  lubridate    1.7.2   2018-02-06 CRAN (R 3.4.3)
#>  magrittr     1.5     2014-11-22 CRAN (R 3.4.1)
#>  memoise      1.1.0   2017-04-21 CRAN (R 3.4.1)
#>  methods    * 3.4.3   2017-12-06 local         
#>  mnormt       1.5-5   2016-10-15 CRAN (R 3.4.1)
#>  modelr       0.1.1   2017-07-24 CRAN (R 3.4.1)
#>  munsell      0.4.3   2016-02-13 CRAN (R 3.4.1)
#>  nlme         3.1-131 2017-02-06 CRAN (R 3.4.3)
#>  parallel     3.4.3   2017-12-06 local         
#>  pillar       1.1.0   2018-01-14 CRAN (R 3.4.3)
#>  pkgconfig    2.0.1   2017-03-21 CRAN (R 3.4.1)
#>  plyr         1.8.4   2016-06-08 CRAN (R 3.4.1)
#>  psych        1.7.8   2017-09-09 CRAN (R 3.4.1)
#>  purrr      * 0.2.4   2017-10-18 CRAN (R 3.4.2)
#>  R6           2.2.2   2017-06-17 CRAN (R 3.4.1)
#>  Rcpp         0.12.15 2018-01-20 CRAN (R 3.4.3)
#>  readr      * 1.1.1   2017-05-16 CRAN (R 3.4.1)
#>  readxl       1.0.0   2017-04-18 CRAN (R 3.4.1)
#>  reshape2     1.4.3   2017-12-11 CRAN (R 3.4.3)
#>  rlang        0.2.0   2018-02-20 CRAN (R 3.4.3)
#>  rmarkdown    1.8     2017-11-17 CRAN (R 3.4.3)
#>  rprojroot    1.2     2017-01-16 CRAN (R 3.4.1)
#>  rstudioapi   0.7     2017-09-07 CRAN (R 3.4.1)
#>  rvest        0.3.2   2016-06-17 CRAN (R 3.4.1)
#>  scales       0.5.0   2017-08-24 CRAN (R 3.4.1)
#>  stats      * 3.4.3   2017-12-06 local         
#>  stringi      1.1.6   2017-11-17 CRAN (R 3.4.2)
#>  stringr    * 1.3.0   2018-02-19 CRAN (R 3.4.3)
#>  tibble     * 1.4.2   2018-01-22 CRAN (R 3.4.3)
#>  tidyr      * 0.8.0   2018-01-29 CRAN (R 3.4.3)
#>  tidyselect   0.2.3   2017-11-06 CRAN (R 3.4.2)
#>  tidyverse  * 1.2.1   2017-11-14 CRAN (R 3.4.3)
#>  tools        3.4.3   2017-12-06 local         
#>  utf8         1.1.3   2018-01-03 CRAN (R 3.4.3)
#>  utils      * 3.4.3   2017-12-06 local         
#>  withr        2.0.0   2017-07-28 CRAN (R 3.4.1)
#>  xml2         1.2.0   2018-01-24 CRAN (R 3.4.3)
#>  yaml         2.1.16  2017-12-12 CRAN (R 3.4.3)

#4

You can avoid the need for unquoting after spread by doing the odds-ratio calculation on the long data:

OR_tab2 <- function(dat, strat, grp, decision ){

  strat <- enquo(strat)
  grp <- enquo(grp)
  decision <- enquo(decision)
  
  dat %>% 
    count(!!strat, !!grp, !!decision) %>% 
    group_by(!!strat) %>% 
    mutate(OR = (n[1]*n[4])/(n[2]*n[3])) %>% 
    unite(cat, c(!!grp, !!decision)) %>% 
    spread(cat, n)
    
}

OR_tab2(df,strata, group, select)

##   strata     OR A_Group_Chosen A_Group_Not_Chosen B_Group_Chosen B_Group_Not_Chosen
##  <chr>   <dbl>          <int>              <int>          <int>              <int>
## 1 Manager 0.333              1                  9              1                  3
## 2 Worker  0.273              1                 11              1                  3

As with your original code, this will work for any data frame where the group and select arguments each have only two levels, but which pairs of levels are in the numerator or denominator will depend on the ordering of the levels in each column. For example, note that for the recoded data frame df2 below, the odds ratios are inverted relative to the those of the original data frame df.

df2 = df %>% 
  mutate(experimental_groups = recode(group, 
                        "A_Group"="Control",
                        "B_Group"="Treatment"),
         flavor = recode(select, 
                         "Chosen"="Vanilla",
                         "Not_Chosen"="Chocolate"))

OR_tab2(df2, strata, experimental_groups, flavor)

##   strata     OR Control_Chocolate Control_Vanilla Treatment_Chocolate Treatment_Vanilla
##   <chr>   <dbl>             <int>           <int>               <int>             <int>
## 1 Manager  3.00                 9               1                   3                 1
## 2 Worker   3.67                11               1                   3                 1

OR_tab(df2, strata, experimental_groups, flavor)

##   strata  Control_Chocolate Control_Vanilla Treatment_Chocolate Treatment_Vanilla    OR
##   <chr>               <dbl>           <dbl>               <dbl>             <dbl> <dbl>
## 1 Manager                9.              1.                  3.                1.  3.00
## 2 Worker                11.              1.                  3.                1.  3.67

#5

Thanks for the re-write, @joels. It improves my function considerably. I added "%>% ungroup()" so the function returns a clean tibble.

Your note about the ordering of the decision variable determining which version of the OR is calculated is interesting. I'm thinking that as ordinary character variables, the ordering is alphabetical. Perhaps I should consider factors with explicit ordering.


#6

Yes, by default the ordering will be alphabetic unless you convert to factors with a custom order.