Why does tidyr's fill work with NAs but not NaNs?

efg · March 6, 2019, 8:44pm

library(tidyverse)
df <- data.frame(Month = 1:5, 
                 Year  = c(2000,  rep(NA,  4)),
                 Year2 = c(2000,  rep(NaN, 4)))
df

tidyr-fill-NaN

df %>% fill(Year, Year2)

tidyr-fill-NaN-example

rpaul · March 6, 2019, 9:06pm

NaN (Not a Number) is not the same as a null or missing value. Usually happens for calculated fields when dividing by 0.
NA (not available) in R is treated as a missing value and hence fill() works on it

efg · March 6, 2019, 11:33pm

Totally understand the definition of NaNs,. It's all in What Every Computer Scientist Should Know About Floating-Point Arithmetic.

But I have NaNs in a dataset that I need to now convert to NAs so I can use fill? In a world of dirty data, there are times there is no difference between a NaN and a NA.

EconomiCurtis · March 6, 2019, 11:40pm

One option; before applying fill, convert your NaNs into NAs via something like dplyr's mutate (or mutate_at or mutate_if for many columns).

library(tidyverse)
df <- data.frame(Month = 1:5, 
                 Year  = c(2000,  rep(NA,  4)),
                 Year2 = c(2000,  rep(NaN, 4)))

df %>% 
  mutate(
    Year2 = ifelse(is.nan(Year2), NA, Year2)
  )
#>   Month Year Year2
#> 1     1 2000  2000
#> 2     2   NA    NA
#> 3     3   NA    NA
#> 4     4   NA    NA
#> 5     5   NA    NA

^{Created on 2019-03-06 by the reprex package (v0.2.1)}

RuReady · March 6, 2019, 11:53pm

Why didn't you use if_else() ?

efg · March 7, 2019, 4:39am

Thanks for your suggestion. I got the following to work. This is a rather ugly mutate applied over about 40 columns:

 mutate_if(is.numeric, funs(ifelse(is.nan(.), NA, .)))  %>%
  fill(-patient)         %>%

This works, but sadly gives a "soft" deprecation message about funs. The suggested list replacement did not work for me.

I believe many use NaN and NA somewhat interchangeably. You can't do math with either, but any replace or fill function should work for both NAs and NaNs IMHO.

The fix in this case is to address the problem at the source. I was happy purrr let me read 5000 files with only a few statements. To get NAs, I'll need to add na = "NaN" to the read_delim function used to process the 5000 MIT PhysioNet files that look like this:

Capture

If the folks at MIT can use NaNs for missing values, I think anyone else can, too.

efg · March 7, 2019, 4:50am

If I use if_else in my mutate_if statement I get this error:

Error: `false` must be a logical vector, not a double vector Call `rlang::last_error()` to see a backtrace

cderv · March 7, 2019, 7:08am

if_else is strict on type and all TRUE or FALSE replacement must be the same type. So when replacing to NA, you need to choose between NA_integer_, NA_real_ or other. NA is NA_logical_ by default. This strict rule applies also to recode, coalesce.

Maybe this will change in the future with {vctrs}

If you apply by column, you can control with anyof this function

case_when can handle the different NA types

See following examples

library(tidyverse)
#> Warning: le package 'tibble' a été compilé avec la version R 3.5.2
#> Warning: le package 'purrr' a été compilé avec la version R 3.5.2
#> Warning: le package 'stringr' a été compilé avec la version R 3.5.2
#> Warning: le package 'forcats' a été compilé avec la version R 3.5.2
df <- data.frame(Month = 1:5, 
                 Year  = c(2000,  rep(NA,  4)),
                 Year2 = c(2000,  rep(NaN, 4)))
df
#>   Month Year Year2
#> 1     1 2000  2000
#> 2     2   NA   NaN
#> 3     3   NA   NaN
#> 4     4   NA   NaN
#> 5     5   NA   NaN

# column by column specifying NA type
df %>%
  mutate(Year2 = coalesce(Year2, NA_real_))
#>   Month Year Year2
#> 1     1 2000  2000
#> 2     2   NA    NA
#> 3     3   NA    NA
#> 4     4   NA    NA
#> 5     5   NA    NA

# all at once
df %>%
  mutate_all( ~ case_when(!is.nan(.x) ~ .x))
#>   Month Year Year2
#> 1     1 2000  2000
#> 2     2   NA    NA
#> 3     3   NA    NA
#> 4     4   NA    NA
#> 5     5   NA    NA

^{Created on 2019-03-07 by the reprex package (v0.2.1)}

Also, I believe a feature request can be sent to dplyr so that fill can apply also on NaN, moreover because

is.na(NaN)
#> [1] TRUE

efg · March 7, 2019, 5:20pm

Thanks much for the great examples and introduction to the coalesce function -- I need that function very soon in some numerical experiments.

In the spirit of Felienne's talk at RStudioConf in which she talked about vocalizing syntax to teach programming, could you tell me in words how you vocalize your mutate_all statement? I know it works, but the first ~ and the ! are not exactly obvious to me in what they're doing.

grrrck · March 7, 2019, 6:24pm

I worked with PhysioNet data a few years ago and am fairly confident it's completely appropriate to interpret "NaN" in the source data as NA during import with read_delim().

cderv · March 7, 2019, 10:13pm

mutate_all means "take df and apply a function to each column in the data.frame
this function in define using a tidyverse syntax as an anonymous function using ~ f(.x) which is equivalent to custom_fun <- function(x) f(x). .x will be replace by a column of df.
case_when(LHS ~ RHS) means creates a vector from RHS depending on condition in LHS.
!is.nan(.x) ~ .x means "if a column a NO NaN, keep the value. Else NA as nothing is provided as other choices in case_when

I hope this explain the code a bit much. Sorry for not being enough precise in the beginning.

efg · March 8, 2019, 3:16am

Excellent explanation. Thanks.

system · March 29, 2019, 3:16am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.