Case_when...why not?

Why, or when, would I use if/else instead of case_when?

2 Likes

Saying if/else is somewhat ambiguous, as there are three potential options: if_else, if.else, and if(...) {} else {}. I'll work in that order, from most similar to least similar.

if_else(): Like case_when, this is vectorized -- the conditional is analyzed for each value in the condition, with the output becoming a hybrid of the given output vectors. If your case_when statement takes on two potential values (and the second condition is of the form TRUE ~ ..., then it is interchangable with if_else. In this case, go with if_else unless you believe that case_when is more readable, because (at least in basic testing) if_else is faster (with a preview of part of why to avoid ifelse():

suppressPackageStartupMessages(library(tidyverse))

microbenchmark::microbenchmark(
  case_when(1:1000 < 100 ~ "low", TRUE ~ "high"),
  if_else(1:1000 < 3, "low", "high"),
  ifelse(1:1000 < 3, "low", "high")
)
#> Unit: microseconds
#>                                            expr     min      lq     mean
#>  case_when(1:1000 < 100 ~ "low", TRUE ~ "high") 384.786 418.629 953.4921
#>              if_else(1:1000 < 3, "low", "high")  61.943  67.686 128.9811
#>               ifelse(1:1000 < 3, "low", "high") 256.797 264.796 391.7180
#>    median       uq       max neval
#>  631.9420 708.4480 33149.364   100
#>   90.0435 127.9885  2496.182   100
#>  327.9695 460.8810  2354.246   100

ifelse(): Not only is this slower than if_else (see above), but it also runs into issues when the TRUE and FALSE vectors can have their types misinterpreted, and doesn't preserve types correctly in some cases. The if_else documentation points this out:

suppressPackageStartupMessages(library(tidyverse))

# Unlike ifelse, if_else preserves types
x <- factor(sample(letters[1:5], 10, replace = TRUE))
ifelse(x %in% c("a", "b", "c"), x, factor(NA))
#>  [1] NA NA  2 NA  2 NA  1 NA  1 NA
if_else(x %in% c("a", "b", "c"), x, factor(NA))
#>  [1] <NA> <NA> c    <NA> c    <NA> b    <NA> b    <NA>
#> Levels: b c d e

if(cond) cons.expr else alt.expr: This is actually a completely different intent than ifelse and if_else, in that cond is treated as a scalar. In fact, if the length is greater than 1, only the first element will be used. As such, only one of the output expressions is evaluated, as you can see if you run the code:

if (FALSE) {Sys.sleep(10); print("Slow")} else print("Fast")
#> [1] "Fast"

(As an aside, if is just a function with some built-in alternative syntax, so x <- if (FALSE) {Sys.sleep(10); "Slow"} else "Fast" is valid code.)

The single-path evaluation is not so with case_when, as both expressions will be evaluated regardless:

case_when(FALSE ~ {Sys.sleep(10); print("Slow")}, TRUE ~ print("Fast"))
#> [1] "Slow"
#> [1] "Fast"
#> [1] "Fast"

In summary, if you are testing a scalar, use if(). Testing a vector against a single condition, dplyr::if_else. Testing a vector against multiple conditions, use case_when.

8 Likes

ifelse/dplyr::if_else is more or less a vectorized if. Caveats: ifelse drops attributes, so

ifelse(TRUE, Sys.Date(), Sys.Date())
#> [1] 17482

fails. if_else maintains some attributes, and so handles dates and factors better:

dplyr::if_else(TRUE, Sys.Date(), Sys.Date())
#> [1] "2017-11-12"

and is more type-safe, but still drops some attributes like dim:

dplyr::if_else(as.logical(diag(2)), diag(2), diag(2))
#> [1] 1 0 0 1

and gets very unhappy if you try to return a more complicated object like a model (unless wrapped in a list, anyway). Since if is not vectorized, it can return any object, which is helpful for working with objects more complicated than vectors:

if (TRUE) lm(mpg ~ wt, mtcars)
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = mtcars)
#> 
#> Coefficients:
#> (Intercept)           wt  
#>      37.285       -5.344

or just running arbitrary code depending on a condition:

flips <- 0

if (rnorm(1) > 0) {
    Sys.sleep(1)
    flips <- flips + 1
    'heads'
} else {
    Sys.sleep(1)
    flips <- flips + 1
    'tails'
}
#> [1] "tails"

flips
#> [1] 1

...but since if is not vectorized, an equivalent call to ifelse would require iterating, which is frequently not the best approach.

In practice, ifelse/if_else tends to be used a lot in dplyr code due to the inability to assign to a subset, so people write

library(dplyr)

mtcars %>% head() %>% mutate(mpg = if_else(mpg > 20, 20, mpg))
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> 1 20.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> 2 20.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> 3 20.0   4  108  93 3.85 2.320 18.61  1  1    4    1
#> 4 20.0   6  258 110 3.08 3.215 19.44  1  0    3    1
#> 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

instead of

mtcars <- head(mtcars)
mtcars[mtcars$mpg > 20, 'mpg'] <- 20

mtcars
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         20.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     20.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        20.0   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    20.0   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

A lot of words have been written on the subject, but the tidyverse idiom has settled into if_else.

case_when can reproduce the behavior of if_else, but requires a condition for each return value. It's a lot more useful for its fallback evaluation, wherein the first condition that returns TRUE determines the return value selected. Before it existed, such cases were not infrequently handled by heinous nested ifelses:

mtcars %>% 
    mutate(mpg_level = ifelse(mpg < 15, 
                              'low',
                              ifelse(mpg < 20, 
                                     'medium-low',
                                     ifelse(mpg < 25, 
                                            'medium-high', 
                                            'high')))) %>%
    sample_n(6)
#>     mpg cyl  disp  hp drat    wt  qsec vs am gear carb   mpg_level
#> 19 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2        high
#> 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2 medium-high
#> 12 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3  medium-low
#> 31 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8  medium-low
#> 21 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1 medium-high
#> 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 medium-high

which can now be written as the more svelte

mtcars %>% 
    mutate(mpg_level = case_when(mpg < 15 ~ 'low',
                                 mpg < 20 ~ 'medium-low',
                                 mpg < 25 ~ 'medium-high',
                                 TRUE ~ 'high')) %>% 
    sample_n(6)
#>     mpg cyl  disp  hp drat    wt  qsec vs am gear carb  mpg_level
#> 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4        low
#> 16 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4        low
#> 30 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6 medium-low
#> 28 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2       high
#> 23 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2 medium-low
#> 11 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4 medium-low

It's likely to be quite a bit less efficient than a findInterval approach, but it's more flexible and arguably easier to write.

3 Likes

Because people got themselves in such knots with nested ifelses, I used to suggest rewriting such statements as slightly slower but much logically clearer multi-pass ifelses

mtcars %>% 
  mutate(mpg_level = "high",
         mpg_level = ifelse(mpg < 25, "medium-high", mpg_level),
         mpg_level = ifelse(mpg < 20, "medium-low", mpg_level),
         mpg_level = ifelse(mpg < 15, "low", mpg_level)
  ) %>%
  sample_n(6)

I've used this strategy for so long I am still adjusting to telling people to use case_when instead.