if_else not working as expected inside dplyr::mutate

tlg265 · September 16, 2019, 9:50am

You can copy the following code to an R script file and run it:

preprocess_brand_version = function(dataset) {
  dataset$brand_version = gsub("^([0-9]+)(\\.[0-9]+)?.*$", "\\1\\2", dataset$brand_version, perl = TRUE)
  dataset = dataset %>% mutate(
    brand_version = ifelse(!(is.na(brand) || is.na(brand_version)), paste(substr(brand, 1, 3), ", ", brand_version, sep = ""), NA)
  )
  dataset$brand_version = as.factor(dataset$brand_version)
  return (dataset)
}

a = data.frame(brand = c("Samsung", "Motorola"), brand_version = c("1.4.3", "6.3"))
b = a
b[1,2] = NA
a
b
preprocess_brand_version(b)

My problem is that when I run that, I get:

> a
     brand brand_version
1  Samsung         1.4.3
2 Motorola           6.3

> b
     brand brand_version
1  Samsung          <NA>
2 Motorola           6.3

> preprocess_brand_version(b)
     brand brand_version
1  Samsung          <NA>
2 Motorola          <NA>

I was expecting to get: "Mot, 6.3" as the new value for the version on Motorola row.

Any idea why the: if_else is not working as I would expect?

Thanks!

valeri · September 16, 2019, 1:29pm

I think you need a very small modification here (substitute || by |):

dataset = dataset %>% mutate(
    brand_version = ifelse(!(is.na(brand) | is.na(brand_version)), paste(substr(brand, 1, 3), ", ", brand_version, sep = ""), NA)
  )

Yarnabrina · September 16, 2019, 2:09pm

Since OP asked why it happens, and not only how to rectify, let me add a small explanation to the answer of @valeri.

@tlg265, if you see the documentation of logical operators (help("Logic", package = "base")), you'll see the following:

& and && indicate logical AND and | and || indicate logical OR. The shorter form performs elementwise comparisons in much the same way as arithmetic operators. The longer form evaluates left to right examining only the first element of each vector. Evaluation proceeds only until the result is determined. The longer form is appropriate for programming control-flow and typically preferred in if clauses.

In typical if situation, you'd want the condition to give you a single TRUE or FALSE. But in case with ifelse, typically you want it to generate a vector of length more than one. If you generate just one, then only the first value of yes or no, based on the value of test, will be used and that's exactly what happens here. To match the number of rows, it then gets replicated.

To illustrate, here's a small example:

> a <- data.frame(p = 1L:5L,
+                 q = 15L:11L,
+                 r = 6L:10L)
> within(data = a,
+        expr =
+          {
+            s1 = ifelse(test = (p > 3L) | (q > 11L), # check only for 1st row
+                        yes = r,
+                        no = -r)
+            s2 = ifelse(test = (p > 3L) || (q > 11L), # checks for all rows
+                        yes = r,
+                        no = -r)
+          })
  p  q  r s2 s1
1 1 15  6  6  6
2 2 14  7  6  7
3 3 13  8  6  8
4 4 12  9  6  9
5 5 11 10  6 10

A short note: if_else and ifelse are not same. Your title says if_else, which is in dplyr and which you're not using at all. It has stricter type requirements than ifelse, which is in base. if_else forces true and false to be of same type.

tlg265 · September 16, 2019, 2:09pm

@valeri you was right!

tlg265 · September 16, 2019, 2:10pm

@Yarnabrina, thank you very much for the explanation!

system · September 23, 2019, 2:10pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.