library(tidyverse)
df <- data.frame(Month = 1:5,
Year = c(2000, rep(NA, 4)),
Year2 = c(2000, rep(NaN, 4)))
df
df %>% fill(Year, Year2)
library(tidyverse)
df <- data.frame(Month = 1:5,
Year = c(2000, rep(NA, 4)),
Year2 = c(2000, rep(NaN, 4)))
df
df %>% fill(Year, Year2)
NaN (Not a Number) is not the same as a null or missing value. Usually happens for calculated fields when dividing by 0.
NA (not available) in R is treated as a missing value and hence fill() works on it
Totally understand the definition of NaNs,. It's all in What Every Computer Scientist Should Know About Floating-Point Arithmetic.
But I have NaNs in a dataset that I need to now convert to NAs so I can use fill? In a world of dirty data, there are times there is no difference between a NaN and a NA.
One option; before applying fill, convert your NaN
s into NA
s via something like dplyr
's mutate
(or mutate_at
or mutate_if
for many columns).
library(tidyverse)
df <- data.frame(Month = 1:5,
Year = c(2000, rep(NA, 4)),
Year2 = c(2000, rep(NaN, 4)))
df %>%
mutate(
Year2 = ifelse(is.nan(Year2), NA, Year2)
)
#> Month Year Year2
#> 1 1 2000 2000
#> 2 2 NA NA
#> 3 3 NA NA
#> 4 4 NA NA
#> 5 5 NA NA
Created on 2019-03-06 by the reprex package (v0.2.1)
Why didn't you use if_else()
?
Thanks for your suggestion. I got the following to work. This is a rather ugly mutate
applied over about 40 columns:
mutate_if(is.numeric, funs(ifelse(is.nan(.), NA, .))) %>%
fill(-patient) %>%
This works, but sadly gives a "soft" deprecation message about funs
. The suggested list
replacement did not work for me.
I believe many use NaN and NA somewhat interchangeably. You can't do math with either, but any replace or fill function should work for both NAs and NaNs IMHO.
The fix in this case is to address the problem at the source. I was happy purrr
let me read 5000 files with only a few statements. To get NAs, I'll need to add na = "NaN"
to the read_delim
function used to process the 5000 MIT PhysioNet files that look like this:
If the folks at MIT can use NaNs for missing values, I think anyone else can, too.
If I use if_else
in my mutate_if
statement I get this error:
Error: `false` must be a logical vector, not a double vector Call `rlang::last_error()` to see a backtrace
if_else
is strict on type and all TRUE or FALSE replacement must be the same type. So when replacing to NA, you need to choose between NA_integer_
, NA_real_
or other. NA
is NA_logical_
by default. This strict rule applies also to recode
, coalesce
.
Maybe this will change in the future with {vctrs}
If you apply by column, you can control with anyof this function
case_when
can handle the different NA types
See following examples
library(tidyverse)
#> Warning: le package 'tibble' a été compilé avec la version R 3.5.2
#> Warning: le package 'purrr' a été compilé avec la version R 3.5.2
#> Warning: le package 'stringr' a été compilé avec la version R 3.5.2
#> Warning: le package 'forcats' a été compilé avec la version R 3.5.2
df <- data.frame(Month = 1:5,
Year = c(2000, rep(NA, 4)),
Year2 = c(2000, rep(NaN, 4)))
df
#> Month Year Year2
#> 1 1 2000 2000
#> 2 2 NA NaN
#> 3 3 NA NaN
#> 4 4 NA NaN
#> 5 5 NA NaN
# column by column specifying NA type
df %>%
mutate(Year2 = coalesce(Year2, NA_real_))
#> Month Year Year2
#> 1 1 2000 2000
#> 2 2 NA NA
#> 3 3 NA NA
#> 4 4 NA NA
#> 5 5 NA NA
# all at once
df %>%
mutate_all( ~ case_when(!is.nan(.x) ~ .x))
#> Month Year Year2
#> 1 1 2000 2000
#> 2 2 NA NA
#> 3 3 NA NA
#> 4 4 NA NA
#> 5 5 NA NA
Created on 2019-03-07 by the reprex package (v0.2.1)
Also, I believe a feature request can be sent to dplyr so that fill
can apply also on NaN
, moreover because
is.na(NaN)
#> [1] TRUE
Thanks much for the great examples and introduction to the coalesce
function -- I need that function very soon in some numerical experiments.
In the spirit of Felienne's talk at RStudioConf in which she talked about vocalizing syntax to teach programming, could you tell me in words how you vocalize your mutate_all
statement? I know it works, but the first ~
and the !
are not exactly obvious to me in what they're doing.
I worked with PhysioNet data a few years ago and am fairly confident it's completely appropriate to interpret "NaN"
in the source data as NA
during import with read_delim()
.
mutate_all
means "take df
and apply a function to each column in the data.frame~ f(.x)
which is equivalent to custom_fun <- function(x) f(x)
. .x
will be replace by a column of df
.case_when(LHS ~ RHS)
means creates a vector from RHS depending on condition in LHS.!is.nan(.x) ~ .x
means "if a column a NO NaN, keep the value. Else NA as nothing is provided as other choices in case_when
I hope this explain the code a bit much. Sorry for not being enough precise in the beginning.
Excellent explanation. Thanks.
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.