Writing a reusable function for survey questions recoding


I usually work with Likert-type surveys that require recoding in numerical data to calculate scores by dimensions or total summations. I currently use case_when( ) but I would like to write a function that will be adapted to questions with different number of possible answers and associated numerical values.
An example of the question I work with:
Question 1
“How often do you have a drink containing alcohol? “
Never = 0
Monthly or less a =1
2-4 times a month =2
2-3 times a week = 3
4 or more times a week = 4
Question 9
“Have you or someone else been injured because of your drinking?”
No =0
Yes but not in the last year = 2
Yes but not in the last year = 4
I have thought of a function whose arguments were 2 vectors. The first for the qualitative values and the second a vector with the corresponding numerical values.
Could you advise me on the best approach using the tidyverse to write this function?




I would recommend using recode(). I think that a re-usable function for a varied number of questions, answers and numerical values is likely going to be basically just a few less lines of code, more prone to bugs, etc.


survey <- tribble(
  ~ id, ~ alcohol, ~ injured
  , 1, "Never", "No"
  , 2, "Monthly or less", "Yes but not in the last year"
  , 3, "2-4 times a month", "Yes"
  , 4, "2-3 times a week", "No"

survey %>% 
    alcohol = recode(
      "Never" = 0, 
      "Monthly or less" = 1, 
      "2-4 times a month" = 2, 
      "2-3 times a week" = 3)
#> # A tibble: 4 x 3
#>      id alcohol injured                     
#>   <dbl>   <dbl> <chr>                       
#> 1     1       0 No                          
#> 2     2       1 Yes but not in the last year
#> 3     3       2 Yes                         
#> 4     4       3 No

One thing you can do is to store the levels and their numerical equivalents in a named character vector and use the splice operator from rlang!!! – to break the vector into named arguments for recode(), like this:

injured_levels <- c(
  "No" = 0,
  "Yes but not in the last year" = 2,
  "Yes" = 5

survey %>% 
  mutate(injured = recode(injured, !!!injured_levels))
#> # A tibble: 4 x 3
#>      id alcohol           injured
#>   <dbl> <chr>               <dbl>
#> 1     1 Never                   0
#> 2     2 Monthly or less         2
#> 3     3 2-4 times a month       5
#> 4     4 2-3 times a week        0

That's pretty compact, easily readable and easy to update and maintain.

Created on 2019-02-06 by the reprex package (v0.2.1)



Thanks for your answer. It is really helpful. :+1:

I could ask you to deepen your comment regarding the possible disadvantages of a reusable function.

What is the purpose of the the splice function?




I could ask you to deepen your comment regarding the possible disadvantages of a reusable function.

It depends on how re-usable you want the function to be. If you want to create something for a fixed workflow to simply de-duplicate your work, then by all means go ahead.

But if you want to write a function that takes any dataframe and any column with any number of recodings... then you'll basically just end up re-inventing mutate() and recode(). If this is the case, I'd recommend just calling mutate() and recode() as I described, for a number of reasons. Primarily dplyr is well-tested and well-written, so you can rely on community code being less buggy than your personal implementation. Second, by using standard functions and workflows, it'll be easier to read and debug your code in the future, or to share with others.

What is the purpose of the the splice function?

In this case, it's the same as writing

recode(injured_levels, "No" = 0, "Yes" = 5", ...)

although I have to admit that splicing is a bit of magic that I don't completely understand. But typically in functions that use tidyeval, it splices a list or vector into named argument-value pairs.



Thanks, again. A very clear explanation. I hope we get more opinions.


closed #6

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.