complete() sets some values to NA

Z3tt · October 17, 2018, 12:08pm

Hi together,

I recently found the tidyr verbs complete(), expand(), and fill() and find them very very useful. However, today I encountered a problem, namely that some of my values were set to NA when using complete().

library(tidyverse)
dat <- tibble(group = c("A", "A", "A", "B", "B", "B"), 
              beta = c(0.012, 0.013, 0.014, 0.012, 0.013, 0.014), 
              D = c(0.174, 0.16, 0.158, 0.204, 0.182, 0.164))
dat
#> # A tibble: 6 x 3
#>   group  beta     D
#>   <chr> <dbl> <dbl>
#> 1 A     0.012 0.174
#> 2 A     0.013 0.16 
#> 3 A     0.014 0.158
#> 4 B     0.012 0.204
#> 5 B     0.013 0.182
#> 6 B     0.014 0.164
dat %>% complete(nesting(group), beta = full_seq(beta, 0.001, tol = 1e-03))
#> # A tibble: 6 x 3
#>   group  beta      D
#>   <chr> <dbl>  <dbl>
#> 1 A     0.012  0.174
#> 2 A     0.013 NA    
#> 3 A     0.014  0.158
#> 4 B     0.012  0.204
#> 5 B     0.013 NA    
#> 6 B     0.014  0.164

^{Created on 2018-10-17 by the reprex package (v0.2.1)}

I've tried several ways of nesting and group_by() but the value belonging to 0.13 is always missing. Is someone able to explain to me what is going on here?

Thanks and all the best,
Cédric

jrlewi · October 17, 2018, 12:50pm

Cédric - what do you want the resulting data frame to be?

Z3tt · October 17, 2018, 1:00pm

Hi, oh to clarify: I shortend the example to make it clear becasue the tibble gets too long otherwise to spot the problem. The final code looks like this:

dat %>% 
  complete(nesting(group), beta = full_seq(beta, 0.0001, tol = 1e-04)) %>%
  fill(D)

In the end, I want to have a tibble with all possible beta's (ranging from 0.0120 to 0.0140 by steps of 0.0001) and fill them with the value before. This is currently messed up since 0.013 is set to NA, resulting in the same values for the whole range of 0.0120 to 0.0139 instead of two unique values, one for the range of 0.0120 to 0.0129 and one for the range of 0.0130 to 0.0139. I attach this as a reprex as well.

library(tidyverse)
dat <- tibble(group = c("A", "A", "A", "B", "B", "B"), 
              beta = c(0.012, 0.013, 0.014, 0.012, 0.013, 0.014), 
              D = c(0.174, 0.16, 0.158, 0.204, 0.182, 0.164))
dat
#> # A tibble: 6 x 3
#>   group  beta     D
#>   <chr> <dbl> <dbl>
#> 1 A     0.012 0.174
#> 2 A     0.013 0.16 
#> 3 A     0.014 0.158
#> 4 B     0.012 0.204
#> 5 B     0.013 0.182
#> 6 B     0.014 0.164
dat %>% 
  complete(nesting(group), beta = full_seq(beta, 0.0001, tol = 1e-04))
#> # A tibble: 42 x 3
#>    group   beta      D
#>    <chr>  <dbl>  <dbl>
#>  1 A     0.012   0.174
#>  2 A     0.0121 NA    
#>  3 A     0.0122 NA    
#>  4 A     0.0123 NA    
#>  5 A     0.0124 NA    
#>  6 A     0.0125 NA    
#>  7 A     0.0126 NA    
#>  8 A     0.0127 NA    
#>  9 A     0.0128 NA    
#> 10 A     0.0129 NA    
#> # ... with 32 more rows
dat %>% 
  complete(nesting(group), beta = full_seq(beta, 0.0001, tol = 1e-04)) %>% 
  fill(D)
#> # A tibble: 42 x 3
#>    group   beta     D
#>    <chr>  <dbl> <dbl>
#>  1 A     0.012  0.174
#>  2 A     0.0121 0.174
#>  3 A     0.0122 0.174
#>  4 A     0.0123 0.174
#>  5 A     0.0124 0.174
#>  6 A     0.0125 0.174
#>  7 A     0.0126 0.174
#>  8 A     0.0127 0.174
#>  9 A     0.0128 0.174
#> 10 A     0.0129 0.174
#> # ... with 32 more rows

^{Created on 2018-10-17 by the reprex package (v0.2.1)}

I hope this helps!

mara · October 17, 2018, 4:58pm

This is indeed strange, and I haven't quite figured it out. Some investigations, though…

Without a replacement value supplied, complete() introduces NAs. From the example in the tidyr docs:

library(tidyverse)

df <- tibble(
  group = c(1:2, 1),
  item_id = c(1:2, 2),
  item_name = c("a", "b", "b"),
  value1 = 1:3,
  value2 = 4:6
)
df %>% complete(group, nesting(item_id, item_name))
#> # A tibble: 4 x 5
#>   group item_id item_name value1 value2
#>   <dbl>   <dbl> <chr>      <int>  <int>
#> 1     1       1 a              1      4
#> 2     1       2 b              3      6
#> 3     2       1 a             NA     NA
#> 4     2       2 b              2      5

^{Created on 2018-10-17 by the reprex package (v0.2.1.9000)}

The fill argument for complete() is described:

A named list that for each variable supplies a single value to use instead of NA for missing combinations.

The output for the full_seq() alone is just;

tidyr::full_seq(c(0.012, 0.013, 0.014, 0.012, 0.013, 0.014), 0.001, tol = 1e-03)
#> [1] 0.012 0.013 0.014

So the output would be the same as your initial dat:

library(tidyverse)
dat <- tibble(group = c("A", "A", "A", "B", "B", "B"), 
              beta = c(0.012, 0.013, 0.014, 0.012, 0.013, 0.014), 
              D = c(0.174, 0.16, 0.158, 0.204, 0.182, 0.164))

dat %>% complete(nesting(group), beta = c(0.012, 0.013, 0.014))
#> # A tibble: 6 x 3
#>   group  beta     D
#>   <chr> <dbl> <dbl>
#> 1 A     0.012 0.174
#> 2 A     0.013 0.16 
#> 3 A     0.014 0.158
#> 4 B     0.012 0.204
#> 5 B     0.013 0.182
#> 6 B     0.014 0.164

Since complete is just;

…a wrapper around expand() , dplyr::left_join()

You should be able to piece it back together by, essentially, using the original table to lookup the value (using a left join), but that seems like a rather roundabout way of approaching things, and it still loses the 0.130 values…

library(tidyverse)
dat <- tibble(group = as.factor(c("A", "A", "A", "B", "B", "B")), 
              beta = c(0.012, 0.013, 0.014, 0.012, 0.013, 0.014), 
              D = c(0.174, 0.160, 0.158, 0.204, 0.182, 0.164))

dat
#> # A tibble: 6 x 3
#>   group  beta     D
#>   <fct> <dbl> <dbl>
#> 1 A     0.012 0.174
#> 2 A     0.013 0.16 
#> 3 A     0.014 0.158
#> 4 B     0.012 0.204
#> 5 B     0.013 0.182
#> 6 B     0.014 0.164

dat %>% complete(nesting(group), beta = full_seq(beta, 0.001, tol = 1e-03))
#> # A tibble: 6 x 3
#>   group  beta      D
#>   <fct> <dbl>  <dbl>
#> 1 A     0.012  0.174
#> 2 A     0.013 NA    
#> 3 A     0.014  0.158
#> 4 B     0.012  0.204
#> 5 B     0.013 NA    
#> 6 B     0.014  0.164

dat2 <- dat %>%
  expand(nesting(group), beta = full_seq(beta, 0.0001, tol = 1e-04))

dat3 <- dat2 %>%
  left_join(dat, by = c("group", "beta"))

head(dat3, n = 12)
#> # A tibble: 12 x 3
#>    group   beta      D
#>    <fct>  <dbl>  <dbl>
#>  1 A     0.012   0.174
#>  2 A     0.0121 NA    
#>  3 A     0.0122 NA    
#>  4 A     0.0123 NA    
#>  5 A     0.0124 NA    
#>  6 A     0.0125 NA    
#>  7 A     0.0126 NA    
#>  8 A     0.0127 NA    
#>  9 A     0.0128 NA    
#> 10 A     0.0129 NA    
#> 11 A     0.013  NA    
#> 12 A     0.0131 NA

^{Created on 2018-10-17 by the reprex package (v0.2.1.9000)}

n.b. if you run that last part, you'll see the 0.120 and 0.140 join works fine, I just didn't want to print that whole sequence

Z3tt · October 17, 2018, 8:29pm

Thank you very much for your investigations, Mara!

The non-wrapped version using expand() and left_join() indeed doesn't work, I've tried it myself.
Should I file an issue? I mean if you don't spot it - and I did only spot it because I plotted it afterward - it introduces serious errors in your data.

I am running this code on a much larger tibble and it seems there is no pattern which values are set to NA. Some values aren't set to NA, others are replaced

mara · October 17, 2018, 9:07pm

Yeah — I keep thinking I'm missing something obvious, which is still possible… Either way, if we're both missing it, then we should (at the very least) better document whatever the intended behaviour is.

jrlewi · October 18, 2018, 2:15am

Not saying I understand this (yet)...but is this what you're looking for? (I shortend the number of rows so we can see the result easier - I tried with your example and it worked as well).

library(tidyverse)
dat <- tibble(group = c("A", "A", "A", "B", "B", "B"), 
              beta = c(0.012, 0.013, 0.014, 0.012, 0.013, 0.014), 
              D = c(0.174, 0.16, 0.158, 0.204, 0.182, 0.164))
dat
#> # A tibble: 6 x 3
#>   group  beta     D
#>   <chr> <dbl> <dbl>
#> 1 A     0.012 0.174
#> 2 A     0.013 0.16 
#> 3 A     0.014 0.158
#> 4 B     0.012 0.204
#> 5 B     0.013 0.182
#> 6 B     0.014 0.164

dat %>%  
  complete(nesting(group), beta =full_seq(dat$beta*1e3, .0005*1e3)/1e3) %>% 
fill(D)
#> # A tibble: 10 x 3
#>    group   beta     D
#>    <chr>  <dbl> <dbl>
#>  1 A     0.012  0.174
#>  2 A     0.0125 0.174
#>  3 A     0.013  0.16 
#>  4 A     0.0135 0.16 
#>  5 A     0.014  0.158
#>  6 B     0.012  0.204
#>  7 B     0.0125 0.204
#>  8 B     0.013  0.182
#>  9 B     0.0135 0.182
#> 10 B     0.014  0.164

^{Created on 2018-10-17 by the reprex package (v0.2.1)}

My hunch was that there is some kind of numerical precision issue that doesn't recognize the (A, 0.013) and (B, 0.013) pairs as appearing in the original data frame and hence filling in the NA...but I can't seem to understand why this would happen. Anyway - this was why I tried multiplying by 1e3 before applying full_seq() then dividing again after...this seemed to change the behavior.

Z3tt · October 18, 2018, 5:28am

Hi John,

Thank you very much for your help!, this is exactly what I am looking for. I had the same idea when I went to bed yesterday - happy that this is a working solution.

Somehow this still puzzles me and I think it's a serious source for errors in your data. Any idea if this could be fixed within the function?

Best, Cédric

jrlewi · October 18, 2018, 12:27pm

The behavior is strange and it is not satisfying to not understand what is going on. Might be worth digging into the source code for fill_seq. It's hard for me to believe it is a numerical precision issue but I admit I still don't completely understand how the tol argument is used to check periodicity and how this might interact with complete.

Might result in another thread or the need to file an issue.

Z3tt · October 18, 2018, 1:14pm

I am going to file an issue since I have no time to dig deeper at the moment.
Thank you all for your help!

mara · October 18, 2018, 2:02pm

Just a little more digging on this (cuz now it's bugging me). It looks like the generated 0.013 and the original match for near equality, but are not identical:

library(tidyverse)
dat <- tibble(group = as.factor(c("A", "A", "A", "B", "B", "B")), 
              beta = c(0.012, 0.013, 0.014, 0.012, 0.013, 0.014), 
              D = c(0.174, 0.160, 0.158, 0.204, 0.182, 0.164))

dat
#> # A tibble: 6 x 3
#>   group  beta     D
#>   <fct> <dbl> <dbl>
#> 1 A     0.012 0.174
#> 2 A     0.013 0.16 
#> 3 A     0.014 0.158
#> 4 B     0.012 0.204
#> 5 B     0.013 0.182
#> 6 B     0.014 0.164

dat2 <- dat %>%
  expand(nesting(group), beta = full_seq(beta, 0.001, tol = 1e-03))

dat2
#> # A tibble: 6 x 2
#>   group  beta
#>   <fct> <dbl>
#> 1 A     0.012
#> 2 A     0.013
#> 3 A     0.014
#> 4 B     0.012
#> 5 B     0.013
#> 6 B     0.014

dat[[2]][[2]]
#> [1] 0.013

dat2$beta[2]
#> [1] 0.013

dat[[2]][[2]] == dat2$beta[2]
#> [1] FALSE

all.equal.numeric(dat[[2]][[2]], dat2$beta[2])
#> [1] TRUE

identical(dat[[2]][[2]], dat2$beta[2])
#> [1] FALSE

^{Created on 2018-10-18 by the reprex package (v0.2.1.9000)}

Z3tt · October 18, 2018, 2:34pm

So it is indeed about precision of the decimal values. It seems the joining causes the problem (as you already pointed out in your first answer that "it still loses the 0.130") and might be problematic for other cases as well, right? Means, actually the code should use all.equal.numeric() instead of ==? Do you want to comment on the issue I've filed?

mara · October 18, 2018, 2:53pm

Yeah, I'm just testing out a few things still. If it's the join (which would make sense), it might be part of a known (and not particularly solvable) issue :

github.com/tidyverse/dplyr

Error with joins when using numeric and integer ID columns with column attributes

opened 01:56PM - 08 Oct 16 UTC

closed 08:51PM - 16 Feb 17 UTC

dannyparsons

When I try to join on an integer column with a numeric column, if the joining co…lumns have attributes added to them it gives an **incompatible types (integer / numeric)** error. Example: ``` require(dplyr) df1 <- data.frame(V1 = c(1, 2, 3), V2 = letters[1:3]) df2 <- data.frame(V1 = 1:3, V3 = LETTERS[1:3]) full_join(df1, df2) # (Runs correctly) attr(df1$V1, "foo") <- "bar" full_join(df1, df2) # Joining, by = "V1" # Error: Can't join on 'V1' x 'V1' because of incompatible types (integer / numeric) ``` Also note that plyr::join does not give an error in the same situation: ``` require(plyr) df1 <- data.frame(V1 = c(1, 2, 3), V2 = letters[1:3]) df2 <- data.frame(V1 = 1:3, V3 = LETTERS[1:3]) plyr::join(df1, df2) # (Runs correctly) attr(df1$V1, "foo") <- "bar" plyr::join(df1, df2) # (Runs correctly) ``` I use column attributes quite a lot so it would be good if this didn't give an error. I think this might be related to #228.