Using pipe inside a function

Hello,
I'm learning about how to use the pipe within a custom function.

I want to place this code inside a function:

library(tidyverse, quietly=TRUE)

data("iris")

iris_col_sd <- sd(iris$Petal.Length)
thres_min <- mean(iris$Petal.Length) - (1.5 * iris_col_sd)
thres_max <- mean(iris$Petal.Length) + (1.5 * iris_col_sd)
iris_outliers <- iris %>% 
  mutate(outliers = if_else((Petal.Length > thres_max | Petal.Length < thres_min),"outlier",""))

head(iris_outliers, 15)

But it's not working properly:

return_outliers <- function (df_input, column_input, multiplier_input) {
  iris_col_sd <- sd(df_input$column_input)
  thres_min <- mean(df_input$column_input) - (multiplier_input * iris_col_sd)
  thres_max <- mean(df_input$column_input) + (multiplier_input * iris_col_sd)
  
  df_input %>% 
    mutate(outliers = if_else(({{ column_input }} > thres_max | {{ column_input }} < thres_min), "outlier",""))
}

results <- return_outliers(iris, Petal.Length, 3)

head(results, 15)

I referenced this page:
https://dplyr.tidyverse.org/articles/programming.html

but I still don't understand what I'm doing wrong. I think the problem has to do with incorrect placement of the data-variable inputs within double braces?

Thanks

Let's start with the first line:

iris_col_sd <- sd(iris$Petal.Length)

becomes

iris_col_sd <- sd(df_input$column_input)

Here you might notice already a problem: outside the function, you are giving a column name explicitly, without quotes. So, this is the same inside a function, you are telling R to look for a column called column_input, whereas what you really want is to tell R to look for a column whose name is the content of the variable column_input. That's actually easy to do just with base R:

df_input[[column_input]]]

note that I didn't use quotes, if column_input <- Petal.Length this is equivalent to writing:

df_input[["Petal.Length"]]

The df_input on the other hand works well, the data frame given as input gets used inside the function:

print_df <- function(df_input){
  df_input
}
print_df(iris)

So we now know how to take the sd:

print_df_sd <- function(df_input, column_input){
  sd(df_input[[column_input]])
}
print_df_sd(iris, "Petal.Length")

This works, and is a classic base R style of function. But, as you can see, when calling the function you need to provide the column name "Petal.Length" in quotes. In the tidyverse, many functions take column names unquoted. This relies on a special mechanism called Non-Standard Evaluation. I would strongly recommend not to try to do that in your own functions: most of the time it just makes things a lot more complicated, for a limited benefit. But if you really want to, see the "Tidy selection" section of your link, a simple way is to use other dplyr functions:

print_df_sd <- function(df_input, ...){
  sd(pull(df_input, ...))
}
print_df_sd(iris, Petal.Length)

Or if you really want to go all the way (but don't ask me to explain):

print_df_sd <- function(df_input, column_input){
  quo_column_input <- rlang::enquo(column_input)
  sd(pull(df_input, !!quo_column_input))
}
print_df_sd(iris, Petal.Length)

So now we have a way to rewrite the first 3 lines, I would recommend using a standard string for the column name (but if you really want to, you can use the enquo pattern):

return_outliers_params <- function (df_input, column_input, multiplier_input) {
  iris_col_sd <- sd(df_input[[column_input]])
  thres_min <- mean(df_input[[column_input]]) - (multiplier_input * iris_col_sd)
  thres_max <- mean(df_input[[column_input]]) + (multiplier_input * iris_col_sd)
  
  list(iris_col_sd, thres_min, thres_max)
}

return_outliers_params(iris, "Petal.Length", 3)

So we're left with the mutate(). If we're passing the column name as a character, we need to use the pronoun .data as described in the "data masking" section of your link:

return_outliers <- function (df_input, column_input, multiplier_input) {
  iris_col_sd <- sd(df_input[[column_input]])
  thres_min <- mean(df_input[[column_input]]) - (multiplier_input * iris_col_sd)
  thres_max <- mean(df_input[[column_input]]) + (multiplier_input * iris_col_sd)
  
  df_input %>% 
    mutate(outliers = if_else( .data[[column_input]] > thres_max | .data[[column_input]] < thres_min,
                               "outlier",""))
}

return_outliers(iris, "Petal.Length", 1.5)

Now, if you're passing the variable unquoted, it gets more complicated. But because I'm lazy, I notice an interesting pattern: for all these sd and mean, you always use the contents of the same column. So it's easier to extract these contents once in the beginning, that's a lot less text:

return_outliers_params <- function (df_input, column_input, multiplier_input) {
  quo_column_input <- rlang::enquo(column_input)
  col_data <- pull(df_input, !!quo_column_input)
  
  col_sd <- sd(col_data)
  thres_min <- mean(col_data) - (multiplier_input * col_sd)
  thres_max <- mean(col_data) + (multiplier_input * col_sd)
  
  list(col_sd, thres_min, thres_max)
}

return_outliers_params(iris, Petal.Length, 3)

And we're left with the mutate(), and you got it perfectly right with the {{ }}!

3 Likes

AlexisW,
Thank you so much for taking the time to break this down step-by-step! It enables me to actually understand how this works. Much appreciated!

What that function would be looking like if curly_curly operator was used, instead of two steps with enquo and !!quo ?

The curly_operator only makes sense if it's used in a data-masking function (like mutate()), or in a tidy-selecting function (like select()), as explained in the Programming article.

Both mean() and sd() are standard base R functions, I don't think you can use the double curly braces in it. Or you have to add a data masking function:


return_outliers_params <- function (df_input, column_input, multiplier_input) {
  
  summarize(col_sd = sd({{ column_input }}),
            thres_min = mean({{ column_input }}) - (multiplier_input * col_sd)
            thres_max = mean({{ column_input }}) + (multiplier_input * col_sd)
  )
}

return_outliers_params(iris, Petal.Length, 3)

EDIT -- Or maybe simpler, you can also add a selecting function, and just replace the enquo-!!

return_outliers_params <- function (df_input, column_input, multiplier_input) {
  col_data <- pull(df_input, {{column_input}})
  
  col_sd <- sd(col_data)
  thres_min <- mean(col_data) - (multiplier_input * col_sd)
  thres_max <- mean(col_data) + (multiplier_input * col_sd)
  
  list(col_sd, thres_min, thres_max)
}

return_outliers_params(iris, Petal.Length, 3)

Thank you very much indeed, exactly what I wanted.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.