complete() sets some values to NA

tidyr

#1

Hi together,

I recently found the tidyr verbs complete(), expand(), and fill() and find them very very useful. However, today I encountered a problem, namely that some of my values were set to NA when using complete().

library(tidyverse)
dat <- tibble(group = c("A", "A", "A", "B", "B", "B"), 
              beta = c(0.012, 0.013, 0.014, 0.012, 0.013, 0.014), 
              D = c(0.174, 0.16, 0.158, 0.204, 0.182, 0.164))
dat
#> # A tibble: 6 x 3
#>   group  beta     D
#>   <chr> <dbl> <dbl>
#> 1 A     0.012 0.174
#> 2 A     0.013 0.16 
#> 3 A     0.014 0.158
#> 4 B     0.012 0.204
#> 5 B     0.013 0.182
#> 6 B     0.014 0.164
dat %>% complete(nesting(group), beta = full_seq(beta, 0.001, tol = 1e-03))
#> # A tibble: 6 x 3
#>   group  beta      D
#>   <chr> <dbl>  <dbl>
#> 1 A     0.012  0.174
#> 2 A     0.013 NA    
#> 3 A     0.014  0.158
#> 4 B     0.012  0.204
#> 5 B     0.013 NA    
#> 6 B     0.014  0.164

Created on 2018-10-17 by the reprex package (v0.2.1)

I've tried several ways of nesting and group_by() but the value belonging to 0.13 is always missing. Is someone able to explain to me what is going on here?

Thanks and all the best,
CĂ©dric


#2

CĂ©dric - what do you want the resulting data frame to be?


#3

Hi, oh to clarify: I shortend the example to make it clear becasue the tibble gets too long otherwise to spot the problem. The final code looks like this:

dat %>% 
  complete(nesting(group), beta = full_seq(beta, 0.0001, tol = 1e-04)) %>%
  fill(D)

In the end, I want to have a tibble with all possible beta's (ranging from 0.0120 to 0.0140 by steps of 0.0001) and fill them with the value before. This is currently messed up since 0.013 is set to NA, resulting in the same values for the whole range of 0.0120 to 0.0139 instead of two unique values, one for the range of 0.0120 to 0.0129 and one for the range of 0.0130 to 0.0139. I attach this as a reprex as well.

library(tidyverse)
dat <- tibble(group = c("A", "A", "A", "B", "B", "B"), 
              beta = c(0.012, 0.013, 0.014, 0.012, 0.013, 0.014), 
              D = c(0.174, 0.16, 0.158, 0.204, 0.182, 0.164))
dat
#> # A tibble: 6 x 3
#>   group  beta     D
#>   <chr> <dbl> <dbl>
#> 1 A     0.012 0.174
#> 2 A     0.013 0.16 
#> 3 A     0.014 0.158
#> 4 B     0.012 0.204
#> 5 B     0.013 0.182
#> 6 B     0.014 0.164
dat %>% 
  complete(nesting(group), beta = full_seq(beta, 0.0001, tol = 1e-04))
#> # A tibble: 42 x 3
#>    group   beta      D
#>    <chr>  <dbl>  <dbl>
#>  1 A     0.012   0.174
#>  2 A     0.0121 NA    
#>  3 A     0.0122 NA    
#>  4 A     0.0123 NA    
#>  5 A     0.0124 NA    
#>  6 A     0.0125 NA    
#>  7 A     0.0126 NA    
#>  8 A     0.0127 NA    
#>  9 A     0.0128 NA    
#> 10 A     0.0129 NA    
#> # ... with 32 more rows
dat %>% 
  complete(nesting(group), beta = full_seq(beta, 0.0001, tol = 1e-04)) %>% 
  fill(D)
#> # A tibble: 42 x 3
#>    group   beta     D
#>    <chr>  <dbl> <dbl>
#>  1 A     0.012  0.174
#>  2 A     0.0121 0.174
#>  3 A     0.0122 0.174
#>  4 A     0.0123 0.174
#>  5 A     0.0124 0.174
#>  6 A     0.0125 0.174
#>  7 A     0.0126 0.174
#>  8 A     0.0127 0.174
#>  9 A     0.0128 0.174
#> 10 A     0.0129 0.174
#> # ... with 32 more rows

Created on 2018-10-17 by the reprex package (v0.2.1)

I hope this helps!


#4

This is indeed strange, and I haven't quite figured it out. Some investigations, though…

Without a replacement value supplied, complete() introduces NAs. From the example in the tidyr docs:

library(tidyverse)

df <- tibble(
  group = c(1:2, 1),
  item_id = c(1:2, 2),
  item_name = c("a", "b", "b"),
  value1 = 1:3,
  value2 = 4:6
)
df %>% complete(group, nesting(item_id, item_name))
#> # A tibble: 4 x 5
#>   group item_id item_name value1 value2
#>   <dbl>   <dbl> <chr>      <int>  <int>
#> 1     1       1 a              1      4
#> 2     1       2 b              3      6
#> 3     2       1 a             NA     NA
#> 4     2       2 b              2      5

Created on 2018-10-17 by the reprex package (v0.2.1.9000)

The fill argument for complete() is described:

A named list that for each variable supplies a single value to use instead of NA for missing combinations.

The output for the full_seq() alone is just;

tidyr::full_seq(c(0.012, 0.013, 0.014, 0.012, 0.013, 0.014), 0.001, tol = 1e-03)
#> [1] 0.012 0.013 0.014

So the output would be the same as your initial dat:

library(tidyverse)
dat <- tibble(group = c("A", "A", "A", "B", "B", "B"), 
              beta = c(0.012, 0.013, 0.014, 0.012, 0.013, 0.014), 
              D = c(0.174, 0.16, 0.158, 0.204, 0.182, 0.164))

dat %>% complete(nesting(group), beta = c(0.012, 0.013, 0.014))
#> # A tibble: 6 x 3
#>   group  beta     D
#>   <chr> <dbl> <dbl>
#> 1 A     0.012 0.174
#> 2 A     0.013 0.16 
#> 3 A     0.014 0.158
#> 4 B     0.012 0.204
#> 5 B     0.013 0.182
#> 6 B     0.014 0.164

Since complete is just;

…a wrapper around expand() , dplyr::left_join()

You should be able to piece it back together by, essentially, using the original table to lookup the value (using a left join), but that seems like a rather roundabout way of approaching things, and it still loses the 0.130 values…

library(tidyverse)
dat <- tibble(group = as.factor(c("A", "A", "A", "B", "B", "B")), 
              beta = c(0.012, 0.013, 0.014, 0.012, 0.013, 0.014), 
              D = c(0.174, 0.160, 0.158, 0.204, 0.182, 0.164))

dat
#> # A tibble: 6 x 3
#>   group  beta     D
#>   <fct> <dbl> <dbl>
#> 1 A     0.012 0.174
#> 2 A     0.013 0.16 
#> 3 A     0.014 0.158
#> 4 B     0.012 0.204
#> 5 B     0.013 0.182
#> 6 B     0.014 0.164

dat %>% complete(nesting(group), beta = full_seq(beta, 0.001, tol = 1e-03))
#> # A tibble: 6 x 3
#>   group  beta      D
#>   <fct> <dbl>  <dbl>
#> 1 A     0.012  0.174
#> 2 A     0.013 NA    
#> 3 A     0.014  0.158
#> 4 B     0.012  0.204
#> 5 B     0.013 NA    
#> 6 B     0.014  0.164

dat2 <- dat %>%
  expand(nesting(group), beta = full_seq(beta, 0.0001, tol = 1e-04))

dat3 <- dat2 %>%
  left_join(dat, by = c("group", "beta"))

head(dat3, n = 12)
#> # A tibble: 12 x 3
#>    group   beta      D
#>    <fct>  <dbl>  <dbl>
#>  1 A     0.012   0.174
#>  2 A     0.0121 NA    
#>  3 A     0.0122 NA    
#>  4 A     0.0123 NA    
#>  5 A     0.0124 NA    
#>  6 A     0.0125 NA    
#>  7 A     0.0126 NA    
#>  8 A     0.0127 NA    
#>  9 A     0.0128 NA    
#> 10 A     0.0129 NA    
#> 11 A     0.013  NA    
#> 12 A     0.0131 NA

Created on 2018-10-17 by the reprex package (v0.2.1.9000)

n.b. if you run that last part, you'll see the 0.120 and 0.140 join works fine, I just didn't want to print that whole sequence


#5

Thank you very much for your investigations, Mara!

The non-wrapped version using expand() and left_join() indeed doesn't work, I've tried it myself.
Should I file an issue? I mean if you don't spot it - and I did only spot it because I plotted it afterward - it introduces serious errors in your data.

I am running this code on a much larger tibble and it seems there is no pattern which values are set to NA. Some values aren't set to NA, others are replaced :thinking:


#6

Yeah — I keep thinking I'm missing something obvious, which is still possible… Either way, if we're both missing it, then we should (at the very least) better document whatever the intended behaviour is.


#7

Not saying I understand this (yet)...but is this what you're looking for? (I shortend the number of rows so we can see the result easier - I tried with your example and it worked as well).

library(tidyverse)
dat <- tibble(group = c("A", "A", "A", "B", "B", "B"), 
              beta = c(0.012, 0.013, 0.014, 0.012, 0.013, 0.014), 
              D = c(0.174, 0.16, 0.158, 0.204, 0.182, 0.164))
dat
#> # A tibble: 6 x 3
#>   group  beta     D
#>   <chr> <dbl> <dbl>
#> 1 A     0.012 0.174
#> 2 A     0.013 0.16 
#> 3 A     0.014 0.158
#> 4 B     0.012 0.204
#> 5 B     0.013 0.182
#> 6 B     0.014 0.164

dat %>%  
  complete(nesting(group), beta =full_seq(dat$beta*1e3, .0005*1e3)/1e3) %>% 
fill(D)
#> # A tibble: 10 x 3
#>    group   beta     D
#>    <chr>  <dbl> <dbl>
#>  1 A     0.012  0.174
#>  2 A     0.0125 0.174
#>  3 A     0.013  0.16 
#>  4 A     0.0135 0.16 
#>  5 A     0.014  0.158
#>  6 B     0.012  0.204
#>  7 B     0.0125 0.204
#>  8 B     0.013  0.182
#>  9 B     0.0135 0.182
#> 10 B     0.014  0.164

Created on 2018-10-17 by the reprex package (v0.2.1)

My hunch was that there is some kind of numerical precision issue that doesn't recognize the (A, 0.013) and (B, 0.013) pairs as appearing in the original data frame and hence filling in the NA...but I can't seem to understand why this would happen. Anyway - this was why I tried multiplying by 1e3 before applying full_seq() then dividing again after...this seemed to change the behavior. :man_shrugging:


#8

Hi John,

Thank you very much for your help!, this is exactly what I am looking for. I had the same idea when I went to bed yesterday - happy that this is a working solution.

Somehow this still puzzles me and I think it's a serious source for errors in your data. Any idea if this could be fixed within the function?

Best, CĂ©dric


#9

The behavior is strange and it is not satisfying to not understand what is going on. Might be worth digging into the source code for fill_seq. It's hard for me to believe it is a numerical precision issue but I admit I still don't completely understand how the tol argument is used to check periodicity and how this might interact with complete.

Might result in another thread or the need to file an issue.


#10

I am going to file an issue since I have no time to dig deeper at the moment.
Thank you all for your help!


#11

Just a little more digging on this (cuz now it's bugging me). It looks like the generated 0.013 and the original match for near equality, but are not identical:

library(tidyverse)
dat <- tibble(group = as.factor(c("A", "A", "A", "B", "B", "B")), 
              beta = c(0.012, 0.013, 0.014, 0.012, 0.013, 0.014), 
              D = c(0.174, 0.160, 0.158, 0.204, 0.182, 0.164))

dat
#> # A tibble: 6 x 3
#>   group  beta     D
#>   <fct> <dbl> <dbl>
#> 1 A     0.012 0.174
#> 2 A     0.013 0.16 
#> 3 A     0.014 0.158
#> 4 B     0.012 0.204
#> 5 B     0.013 0.182
#> 6 B     0.014 0.164

dat2 <- dat %>%
  expand(nesting(group), beta = full_seq(beta, 0.001, tol = 1e-03))

dat2
#> # A tibble: 6 x 2
#>   group  beta
#>   <fct> <dbl>
#> 1 A     0.012
#> 2 A     0.013
#> 3 A     0.014
#> 4 B     0.012
#> 5 B     0.013
#> 6 B     0.014

dat[[2]][[2]]
#> [1] 0.013

dat2$beta[2]
#> [1] 0.013

dat[[2]][[2]] == dat2$beta[2]
#> [1] FALSE

all.equal.numeric(dat[[2]][[2]], dat2$beta[2])
#> [1] TRUE

identical(dat[[2]][[2]], dat2$beta[2])
#> [1] FALSE

Created on 2018-10-18 by the reprex package (v0.2.1.9000)


#12

So it is indeed about precision of the decimal values. It seems the joining causes the problem (as you already pointed out in your first answer that "it still loses the 0.130") and might be problematic for other cases as well, right? Means, actually the code should use all.equal.numeric() instead of ==? Do you want to comment on the issue I've filed?


#13

Yeah, I'm just testing out a few things still. If it's the join (which would make sense), it might be part of a known (and not particularly solvable) issue :grimacing::