I have a tmpFlag logical outside of the tribble ttmp.
Could somebody explain why my first version of ifelse command gives a wrong output and the second version resolves it by adding rowwise()? The third version is correct as expected.
Thanks.
Ha
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.1.3
#> Warning: package 'ggplot2' was built under R version 4.1.3
#> Warning: package 'tibble' was built under R version 4.1.3
#> Warning: package 'tidyr' was built under R version 4.1.2
#> Warning: package 'readr' was built under R version 4.1.2
#> Warning: package 'purrr' was built under R version 4.1.2
#> Warning: package 'dplyr' was built under R version 4.1.3
#> Warning: package 'stringr' was built under R version 4.1.2
#> Warning: package 'forcats' was built under R version 4.1.2
#tidyverse: version 1.3.1
tmpFlag = FALSE
ttmp = tibble(x=c(1:4, NA))
# Wrong answer
ttmp %>%
mutate(y=ifelse(tmpFlag, NA, x))
#> # A tibble: 5 x 2
#> x y
#> <int> <int>
#> 1 1 1
#> 2 2 1
#> 3 3 1
#> 4 4 1
#> 5 NA 1
# Correct answer by adding rowwise()
ttmp %>%
rowwise() %>%
mutate(y=ifelse(tmpFlag, NA, x))
#> # A tibble: 5 x 2
#> # Rowwise:
#> x y
#> <int> <int>
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 4
#> 5 NA NA
# This has a correct output as expected
ttmp %>%
mutate(Flag=tmpFlag,
y=ifelse(Flag, NA, x))
#> # A tibble: 5 x 3
#> x Flag y
#> <int> <lgl> <int>
#> 1 1 FALSE 1
#> 2 2 FALSE 2
#> 3 3 FALSE 3
#> 4 4 FALSE 4
#> 5 NA FALSE NA
The "seemingly" unexpected result that you obtain in the first code can be explained by 2 reasons:
A. when you create a new column in a data frame/tibble, if you only provide a single value to the new column, the same value will be automatically repeated enough times to populate the entire column.
For example, if you want to add a status column to the mtcars dataset, which contains the word "new", you can do it this way. Notice how we do not need to repeat "new" ourselves:
B. The ifelse() function is a vectorized function. This means that it is applied to all the elements of a vector. More specifically, it runs as many times as the number of conditions provided to it.
Here is your first code - the code which gives you the result you don't want:
ttmp %>%
mutate(y=ifelse(tmpFlag, NA, x))
tmpFlag is a vector of length 1, it contains a single FALSE. This means that the ifelse function only runs once!. Since the value of the condition is FALSE, the function will return the first value of x, which is 1, and will stop running. This is where point A above will kick in - the value 1 will be repeated enough times to populate the entire y column.
In the 3rd code works as expected because the condition is Flag and has as many elements as the number of rows in ttmp.