Filter does not work, is this known behavior or a bug?

Dear R aficionados,

I'm getting unexpected behavior from the following code:


library(tidyr)
library(dplyr)
# data
s <- seq(1, 4, by = 0.1)
df <- crossing(A = s,
               B = s,
               C = s)

# works
df %>% filter(A == 1.5)
df %>% filter(A == 1.5, B == 1)
df %>% filter(A == 1.5, B == 1, C == 3.2)

# Does not work
df %>% filter(A == 1.5, B == 1, C == 3.8)

# Does not work
df[df$A == 1.5 & df$B == 1 & df$C == 3.8,]

# rounding works
df %>% filter(round(A,1) == 1.5, round(B,1) == 1, round(C,1) == 3.8)

# generated sequence shows good behavior
c <- df %>%  pull(C) %>% unique()
s == c

# with no hidden decimals
s %>% as.character()

Is this a know issue or a bug?

Thanks

Jannik

session info:

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.1 tidyr_1.1.1

loaded via a namespace (and not attached):
 [1] fansi_0.4.1      utf8_1.1.4       assertthat_0.2.1 packrat_0.5.0    crayon_1.3.4    
 [6] R6_2.4.1         lifecycle_0.2.0  magrittr_1.5     pillar_1.4.6     cli_2.0.2       
[11] rlang_0.4.7      rstudioapi_0.11  vctrs_0.3.2      generics_0.0.2   ellipsis_0.3.1  
[16] tools_3.6.3      glue_1.4.1       purrr_0.3.4      compiler_3.6.3   pkgconfig_2.0.3 
[21] tidyselect_1.1.0 tibble_3.0.3 

its floating point precision.
look at:

unique(df$C) - 3.8
# [1] -2.800000e+00 -2.700000e+00 -2.600000e+00 -2.500000e+00 -2.400000e+00 -2.300000e+00 -2.200000e+00 -2.100000e+00
# [9] -2.000000e+00 -1.900000e+00 -1.800000e+00 -1.700000e+00 -1.600000e+00 -1.500000e+00 -1.400000e+00 -1.300000e+00
# [17] -1.200000e+00 -1.100000e+00 -1.000000e+00 -9.000000e-01 -8.000000e-01 -7.000000e-01 -6.000000e-01 -5.000000e-01
# [25] -4.000000e-01 -3.000000e-01 -2.000000e-01 -1.000000e-01  4.440892e-16  1.000000e-01  2.000000e-01

dplyr provides a near() function, which is vectorised and will match numerics to a given tolerance, when not specified this is

.Machine$double.eps^0.5
[1] 1.490116e-08

example:

# replace
df %>% filter(A == 1.5, B == 1, C == 3.8)
# with 
df %>% filter(near(A,1.5), near(B, 1), near(C ,3.8))
# A tibble: 1 x 3
# A     B     C
# <dbl> <dbl> <dbl>
#   1   1.5     1   3.8
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.