object.size of tibble inflated by na.action attribute

I have a list of tibbles (about 200 50 x 3 tibbles) that has huge memory size (over 2 GB).

I did some detective work, and I find that these tibbles have a lot of information in the na.action attribute. A vector of about 700,000 rows while the tibble itself is only 50 rows.

In other words, the data in the tibble is about 1 KB but the na.action attribute is about 45 MB.

Is there a way to clear this out? How did this happen?

Here is the str() resut:

tibble [50 x 3] (S3: tbl_df/tbl/data.frame)
 $ ts       : POSIXct[1:50], format: "2020-10-12 16:06:00" "2020-10-12 16:08:00" "2020-10-12 16:10:00" "2020-10-12 16:12:00" ...
 $ fs_flow  : num [1:50] -0.0273 -0.0265 -0.0257 -0.0249 -0.0241 ...
 $ viscosity: num [1:50] 466 465 464 463 462 ...
 - attr(*, "na.action")= 'omit' Named int [1:696797] 26303 26304 26305 26306 26307 26308 26309 26310 26311 26312 ...
  ..- attr(*, "names")= chr [1:696797] "26303" "26304" "26305" "26306" ...

it happens when na.omit() is used.

(DF <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA)))
(nadf <- na.omit(DF))
str(nadf)

to remove the attributes, a simple thing might be to just cast to data.frame

str(data.frame(nadf))

but if this throws away other attributes you may want to keep you could target the specific na.action attribute for deletion like this

attr(nadf,"na.action") <- NULL
str(nadf)
1 Like

Thank you!

This honestly seems like a problem. In this case, I have 200 50-row tibbles that are all filtered subsets of a master tibble with a million rows. I guess the na.action attribute of the 1 million-row tibble remains in all the 50-row tibbles?

Not at all an edge case. I wonder if this problem is an unrecognized gremlin in a lot of R code.

you could make your own convenience function, to apply na.omit, and immediately throw away the attribute, assuming you often wish to do this ?

Thanks. I know how to deal with it now. I think I'll start using filter(complete.cases(.)) instead.

However, I'm worried about others who aren't aware of the problem. It's a problem waiting to happen!

Created a reprex to demonstrate how na.omit() combined with nest() will explode object size.

library(tidyverse)

# define data frame with missing values
df <- 
  rep(
    list(iris %>% mutate(Sepal.Width = if_else(Sepal.Width < 3, NA_real_, Sepal.Width))), 
    1000
  ) %>%
  bind_rows() %>%
  as_tibble()

# function to print object size in MB
print_size <- function(x) x %>% object.size() %>% print(units = "MB")

# small increase in size due to na.omit()
df %>% print_size()
#> 5.2 Mb
df %>% na.omit() %>% print_size()
#> 6.9 Mb

# nesting explodes the size
df %>% nest(data = -Sepal.Length) %>% print_size()
#> 4.1 Mb
df %>% na.omit() %>% nest(data = -Sepal.Length) %>% print_size()
#> 120.8 Mb

# better option
df %>% filter(complete.cases(.)) %>% nest(data = -Sepal.Length) %>% print_size()
#> 2.5 Mb

Created on 2022-03-02 by the reprex package (v2.0.1)

1 Like

This might be a bug with na.omit() since the behavior is present with tibbles and dataframes alike, but if you use the tidyverse equivalent tidyr::drop_na() the problem is no longer present.

library(tidyverse)

# define data frame with missing values
df <- 
    rep(
        list(iris %>% mutate(Sepal.Width = if_else(Sepal.Width < 3, NA_real_, Sepal.Width))), 
        1000
    ) %>%
    bind_rows() %>%
    as_tibble()

# function to print object size in MB
print_size <- function(x) x %>% object.size() %>% print(units = "MB")


df %>% na.omit() %>% nest(data = -Sepal.Length) %>% print_size()
#> 120.8 Mb

df %>% drop_na() %>% nest(data = -Sepal.Length) %>% print_size()
#> 2.5 Mb

Created on 2022-03-02 by the reprex package (v2.0.1)

I think it would be worthy to formally report this issue. If the bug is with the stats::na.omit() function you would need to follow these instructions R: Bug Reporting in R.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.