 # Filtering NA form a list

Hi! I have a question regarding this dataset that I have.

``````x = list(a = c(3,4,5),
b = list(b1 = 10:13, b2 = 11:14, b3 = 12:15),
c = list(c1 = c(1,2,NA,4),
c2 = c(1,2,NA,5),
c3 = c(1,2,NA,6)))
df <- x %>%
as_tibble()
``````

The goal is to filter the NA from column C and showing which value belongs in column B. This may be a silly question, however, my real dataset has - as the column C in this regrex - an integer list with 1800 values in it and almost 250 columns B following the same logic.
Please, any comment that you can tell me will be highly appreciated.
Regards

PS. I know that there are NA in this column C by doing this

``````df %>%
mutate(d = map_int(c, ~sum(is.na(.))))
``````

However, I just no only need how much NA there are, I require to know which data from the other column is associated with. The final result must be looks like this

``````df_expected <- tibble(
a = c(3,4,5),
b = c(12,13,14),
c = c(NA, NA, NA)
)

``````

Have you checked if function unlist() helps you achieve your goal or not ?
Please check unlist, convert it into a tibble and then apply transformations to it into the format you need.

~Arnab

Hi! thanks for the answer. I have thought in unlisted or unnested, however, there is a lot of data involved which makes me think twice in using it.

can you quantify what is meant by 'a lot of data' ?
also, is the data all numeric, or of mixed types ?

The second approach assumes a lot of regularity in your data, vis, that b and c have mathcing lengths in their sublists, that there are NA's to find, and not more than one NA to find in each. (as per your example)
but the second solution is a good 10 times faster

``````library(tidyverse)
library(microbenchmark)
x = list(a = c(3,4,5),
b = list(b1 = 10:13, b2 = 11:14, b3 = 12:15),
c = list(c1 = c(1,2,NA,4),
c2 = c(1,2,NA,5),
c3 = c(1,2,NA,6)))
df <- x %>%
as_tibble()

big_df<-rep(list(df),10^5) %>% bind_rows()

microbenchmark(unnest_filter = big_df_new1<- big_df %>% unnest() %>% filter(is.na(c)),times = 5L)
# Unit: seconds
# expr      min       lq     mean   median       uq      max neval
# unnest_filter 11.08461 11.83448 12.68876 12.23195 13.84765 14.44511     5

dobypurrr <- function(x){

index_of_na_c <- purrr::map_int(x\$c,
~which(is.na(.)))
new <- x
new\$b <- map2_int(x\$b,index_of_na_c,
~.x[[.y]])
new\$c <- NA_integer_
new

}

microbenchmark(purrring = big_df_new2 <- dobypurrr(big_df), times=5L)
# Unit: milliseconds
# expr      min       lq     mean   median       uq      max neval
# purrring 820.3489 822.4799 834.3503 822.9257 833.3612 872.6359     5``````

Thanks a lot for your answer.
What I mean with "a lot of data" is that both lists which belong to column B and C in the real data have 1800 values in it. That made me think twice the idea to unnest them to obtain the result that I am looking for.
This figure can give you a better idea of what I mean

I suggest you try the unnest, it has be edit of being simple. Let us know if there are issue.

Hi! Thanks for the answer. Actually I have a problem with the solution proposed. By the time to use "which", its returns integer(0) when does not find any NA, which is correct. The problem is when I use that indexes to the next part of the algorithm and an error occurs when I try to subset a list by zero.
I am trying to filtering the integer zero with purr::discard with no result. Can you help me with that please?
Regards

The unnest solution doesn't use which... I showed two approaches, but we shouldn't put too much effort optimise what is good enough , the first solution.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.