Using case_when() over multiple columns

bragks · October 29, 2018, 9:48am

library(dplyr, warn = FALSE)

set.seed(1)

values <- c("1+1", "1+2", "1+3", "1+4", "1+5", "1+6")
Df <- tibble(ab = sample(values, 10, replace = TRUE), 
             ba = sample(values, 10, replace = TRUE), 
             cd = sample(values, 10, replace = TRUE),
             dc = sample(values, 10, replace = TRUE), 
             de = sample(values, 10, replace = TRUE),
             ed = sample(values, 10, replace = TRUE))

Df
#> # A tibble: 10 x 6
#>    ab    ba    cd    dc    de    ed   
#>    <chr> <chr> <chr> <chr> <chr> <chr>
#>  1 1+2   1+2   1+6   1+3   1+5   1+3  
#>  2 1+3   1+2   1+2   1+4   1+4   1+6  
#>  3 1+4   1+5   1+4   1+3   1+5   1+3  
#>  4 1+6   1+3   1+1   1+2   1+4   1+2  
#>  5 1+2   1+5   1+2   1+5   1+4   1+1  
#>  6 1+6   1+3   1+3   1+5   1+5   1+1  
#>  7 1+6   1+5   1+1   1+5   1+1   1+2  
#>  8 1+4   1+6   1+3   1+1   1+3   1+4  
#>  9 1+4   1+3   1+6   1+5   1+5   1+4  
#> 10 1+1   1+5   1+3   1+3   1+5   1+3

^{Created on 2018-10-29 by the reprex package (v0.2.1)}

I need to change values in some data like this based on logical conditions. I could use a separate case_when for each separate column, but the real dataset has alot of columns so I'd prefer not to...The output should be numeric, example for column ab:

Df %>% 
  select(ab) %>% 
  mutate(ab = case_when(
    ab == "1+1" | ab == "1+2" ~ 1,
    ab == "1+3" | ab == "1+4" ~ 2,
    ab == "1+5" | ab == "1+6" ~ 3
  ))

^{Created on 2018-10-29 by the reprex package (v0.2.1)}

I guess this is a job for purrr (or some scoped dplyr verb??), but I can't really figure out where the case_when() would fit in this context, especially since the logicals are different because of different column names. Any help is greatly appreciated!

rensa · October 29, 2018, 10:26am

Hey there, @bragks! Thankfully, you don't need to use case_when() separately for each column, and you don't need to use purrr to iterate over the columns. Instead, as you suspected, you can use the scoped variants of mutate: mutate_if() and mutate_at()

mutate_if() lets you mutate all columns that return TRUE for a predicate function that you supply;
mutate_at() lets you specify the columns you want to mutate using helpers like starts_with() or one_of(); and
mutate_all() lets you mutate... all columns

When you use these scoped variants, you wrap the mutation function inside funs(), and then you have access to . to specify the original column. For example:

Df %>% 
  mutate_at(
    vars(one_of('ab', 'de')),
    funs(case_when(
    . == "1+1" | . == "1+2" ~ 1,
    . == "1+3" | . == "1+4" ~ 2,
    . == "1+5" | . == "1+6" ~ 3)))
#> # A tibble: 10 x 6
#>       ab ba    cd    dc       de ed   
#>    <dbl> <chr> <chr> <chr> <dbl> <chr>
#>  1     1 1+2   1+6   1+3       3 1+3  
#>  2     2 1+2   1+2   1+4       2 1+6  
#>  3     2 1+5   1+4   1+3       3 1+3  
#>  4     3 1+3   1+1   1+2       2 1+2  
#>  5     1 1+5   1+2   1+5       2 1+1  
#>  6     3 1+3   1+3   1+5       3 1+1  
#>  7     3 1+5   1+1   1+5       1 1+2  
#>  8     2 1+6   1+3   1+1       2 1+4  
#>  9     2 1+3   1+6   1+5       3 1+4  
#> 10     1 1+5   1+3   1+3       3 1+3  

Df %>% 
  mutate_all(
    funs(case_when(
    . == "1+1" | . == "1+2" ~ 1,
    . == "1+3" | . == "1+4" ~ 2,
    . == "1+5" | . == "1+6" ~ 3)))
#> # A tibble: 10 x 6
#>       ab    ba    cd    dc    de    ed
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     1     1     3     2     3     2
#>  2     2     1     1     2     2     3
#>  3     2     3     2     2     3     2
#>  4     3     2     1     1     2     1
#>  5     1     3     1     3     2     1
#>  6     3     2     2     3     3     1
#>  7     3     3     1     3     1     1
#>  8     2     3     2     1     2     2
#>  9     2     2     3     3     3     2
#> 10     1     3     2     2     3     2

I hope that helps!

mishabalyasin · October 29, 2018, 10:29am

Just a heads-up about this: funs() are going to be deprecated in near future - soft deprecate funs() by romainfrancois · Pull Request #3930 · tidyverse/dplyr · GitHub. You can still use exactly the same approach, but your functions need to be wrapped into list() instead.

bragks · October 29, 2018, 12:24pm

Thank you, @rensa! That . keeps dropping out of my memory.

I see that I forgot one part of my question:
After changing the values from each column, I need to add a new column containing the column NAME of the max value(s) for each observation. E.g. from your tibble this would be c("cd", "de") for obs1 and "ab" for obs3. Any ideas? I suspect this might not be as straight forward your first solution.

rensa · October 29, 2018, 1:05pm

That's okay! Do you mean to say that you need a single new column containing the row-wise maxima (each element of the new column contains the maximum of that row)? If so, here's a bit of a wild way to do it (basically lifted from this SO answer, although I don't know whether I'm making things less safe by turning it into a one-liner with names(.)):

Df %>% 
  mutate_all(
    funs(case_when(
    . == "1+1" | . == "1+2" ~ 1,
    . == "1+3" | . == "1+4" ~ 2,
    . == "1+5" | . == "1+6" ~ 3))) %>%
mutate(rowmax = pmax(!!!rlang::syms(names(.))))
#> # A tibble: 10 x 7
#>       ab    ba    cd    dc    de    ed rowmax
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
#>  1     1     1     3     2     3     2      3
#>  2     2     1     1     2     2     3      3
#>  3     2     3     2     2     3     2      3
#>  4     3     2     1     1     2     1      3
#>  5     1     3     1     3     2     1      3
#>  6     3     2     2     3     3     1      3
#>  7     3     3     1     3     1     1      3
#>  8     2     3     2     1     2     2      3
#>  9     2     2     3     3     3     2      3
#> 10     1     3     2     2     3     2      3

Ideally you'd replace rlang::syms(names(.)) with something else that turns the aforementioned scoping functions into a list that can be unspliced in the same way, so that you can use this even if you have other columns that you want to exclude. But my tidyeval knowledge isn't quiiiiiiite there yet

bragks · October 29, 2018, 2:29pm

Thank you again! Almost that, but not exactly, I had a look at that SO answer, but it's not exactly what I need. I don't know if this method is "correct", but I've almost manged to get what I need with dplyr, stringr and tidyr.

library(dplyr, warn = FALSE)
library(tidyr)
library(stringr)
set.seed(1)

values <- c("1+1", "1+2", "1+3", "1+4", "1+5", "1+6")
Df <- tibble(
             ab = sample(values, 10, replace = TRUE), 
             ba = sample(values, 10, replace = TRUE), 
             cd = sample(values, 10, replace = TRUE),
             dc = sample(values, 10, replace = TRUE), 
             de = sample(values, 10, replace = TRUE),
             ed = sample(values, 10, replace = TRUE))

Df <- Df %>% 
  mutate_all(
    funs(case_when(
      . == "1+1" | . == "1+2" ~ 1,
      . == "1+3" | . == "1+4" ~ 2,
      . == "1+5" | . == "1+6" ~ 3))) %>% 
  mutate(id = 1:10)


Df %>% gather(key = "varname", value = "values", ab:ed) %>% 
  mutate(has_letter = case_when(
    str_detect(varname, "a") ~ "has_a",
    str_detect(varname, "c") ~ "has_c",
    str_detect(varname, "e") ~ "has_e"
  )) %>% 
  group_by(id, has_letter) %>% 
  summarise(maxval = max(values)) %>% 
  spread(key = has_letter, value = maxval) %>% 
  #Is it possible optimize the following case_when so I don't have to write out all possible conditions? 
  #I need the NAME of the column with the highest value, if more than one, I can combine, e.g "ac", "ce".
  #The original dataset also has NA in some cols, so I would need a lot of different conditions with this method (I think...)
  mutate(maxcolname = case_when(
    has_a == has_c & has_a == has_e ~ "equal",
    has_a > has_c & has_a > has_e ~ "a"
  ))

I've added a few comments in the reprex where I'm stuck...

bragks · October 30, 2018, 12:37pm

I posted a simplified (and probably much clearer) version of my second question on SO if anyone is interested.

mara · October 30, 2018, 1:14pm

Looks like you got an answer there, as well:

system · November 6, 2018, 1:14pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.