Passing in column name as a function parameter into map function

jarvis · March 1, 2019, 3:51pm

I'm new to packages like purrr and rlang. I'm trying to write a function that takes a dataset, a column name to group by, and a column name to get quantiles for. This is what I have so far

library(tidyverse)

create_quantile_dfs <- function(data, group_col, metric_col, quantile_vector = c(0.01, seq(.05, .95, .05), .99)) {

  # get the number of groups
  num_variants <- data %>% select(!! group_col) %>% unique() %>% length()
  
  df_quantiles <- data %>%
    # a sort of groupby that allows functional programming on the other column???
    nest(- !!group_col) %>%
    # get the quantiles then revert back to the dataframe we're used to
    mutate(quantiles = map(data, ~ quantile(.$conc, na.rm=TRUE,
                                            probs = quantile_vector),
                           quantiles = map(quantiles, ~ bind_rows(.) %>% gather()))) %>%
    unnest(quantiles)
  
  # label quantile values with the tau value
  quantile_key <- as.character(quantile_vector)
  df_quantiles$quantile_key <- rep(quantile_key, num_variants)
  
  return(df_quantiles)
}

create_quantile_dfs(CO2, quo(Treatment), quo(conc))

But I can't find a way to get rid of the explicit column name in map (map(data, ~ quantile(.$conc). I'd like to use the function parameter metric_col instead. I don't understand using quo and !! under the hood to see why it won't play well with .$. Please help if possible and ty!

joels · March 1, 2019, 4:21pm

This looks like a case where you'd want to use group_by (rather than map) to operate by group.

In the code below:*

The ... allows you to enter any number of grouping columns (or none) rather than just one.
The calls to enquo and enquos are how you capture unevaluated arguments in the tidyeval system. These arguments are later evaluated with the !! for a single argument (e.g., !!value.col) or !!! for multiple arguments (e.g., !!!group.cols). enquos(...) captures all of the grouping variables as a list of quosures.
The quantiles are generated by group within summarise.
Within summarise:
- quantile(!!value.col, probs=probs) generates the quantiles for whatever column was entered as value.col.
- enframe converts the named vector returned by quantile to a data frame with name and value columns we choose.
- The whole thing is wrapped in list which results in a nested data frame. Then we unnest to return the final desired data frame.

quantiles_by_group = function(data, value.col, ..., probs=c(0.01, seq(.05, .95, .05), .99)) {
  
  value.col=enquo(value.col)
  group.cols=enquos(...)
  
  data %>% 
    group_by(!!!group.cols) %>% 
    summarise(!!value.col := list(enframe(quantile(!!value.col, probs=probs), name="quantile", value=quo_text(value.col)))) %>% 
    unnest
}

quantiles_by_group(CO2, conc, Treatment)
quantiles_by_group(CO2, conc) # No grouping variables
quantiles_by_group(CO2, conc, Treatment, Type, Plant, probs=c(0.25, 0.75)) # Multiple grouping variables
quantiles_by_group(mtcars, mpg, cyl)
quantiles_by_group(iris, Petal.Width, Species)

When working with tidy evaluation, I usually feel like I'm walking around blindfolded, so I can't guarantee that this approach is the "right" way to do it, but at least it works.

The function can be generalized further to summarize all numeric columns using summarise_if instead of summarise. Note below that to extract the name of each numeric column we use quo_text(quo(.)). I originally thought the appropriate incantation would be quo_text(enquo(.)), but due to my limited understanding of tidyeval, I'm not sure why one works and the other doesn't.

quantiles_by_group2 = function(data, ..., probs=c(0.25, 0.75)) {
  
  group.cols=enquos(...)
  
  data %>% 
    group_by(!!!group.cols) %>% 
    # Get quantiles for all numeric columns
    summarise_if(is.numeric, 
                 funs(
                   list(
                     enframe(
                       quantile(., probs=probs), 
                       name="quantile", 
                       value=quo_text(quo(.))
                     )
                   )
                 )
    ) %>% 
    unnest %>% 
    # Remove the repeated quantile columns
    select(-matches("quantile."))
}

quantiles_by_group2(iris, Species)

  Species    quantile Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>      <chr>           <dbl>       <dbl>        <dbl>       <dbl>
1 setosa     25%              4.8         3.2          1.4          0.2
2 setosa     75%              5.2         3.68         1.58         0.3
3 versicolor 25%              5.6         2.52         4            1.2
4 versicolor 75%              6.3         3            4.6          1.5
5 virginica  25%              6.22        2.8          5.1          1.8
6 virginica  75%              6.9         3.18         5.88         2.3

* Which I've adapted from an answer I wrote on Stack Overflow a while back.

jarvis · March 1, 2019, 4:32pm

Awesome! I love how you explained the different parts of what your function does. I need to read up on these packages to feel more comfortable going forward haha... thanks again for your help!

system · March 8, 2019, 11:41pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.