Number of unique rows including unknown

Is there a way to count the number of unique rows in a data.frame accounting for unknown values. I need uniqueness under the assumption that missing values can be anything. Here are a few examples of what I mean:

df1 <- data.frame(
  V1 = c("A","B","C","D"),
  V2 = c("X","Y","Z","W")
)

> df1
  V1 V2
1  A  X
2  B  Y
3  C  Z
4  D  W

Would return 4, as there are 4 unique values, this is the same as nrow(unique(df1)). However the following:

df2 <- data.frame(
  V1 = c("A","B","C","C"),
  V2 = c("X","Y","Z",NA)
)

> df2
  V1   V2
1  A    X
2  B    Y
3  C    Z
4  C <NA>

Would return 3 as the bottom row could be identical to the 3rd row.

df3 <- data.frame(
  V1 = c("A","B","C","C","B","B","A",NA),
  V2 = c("X","Y","Z",NA,"W","W",NA,"X")
)

> df3
    V1   V2
1    A    X
2    B    Y
3    C    Z
4    C <NA>
5    B    W
6    B    W
7    A <NA>
8 <NA>    X

Would return 4 since we would count rows 1,2 & 3 as unique, plus rows 5 & 6 are the same. Row 4 could be the same as row 3 and so does not increase the count and rows 7 & 8 could be the same as row 1 and so also do not affect the count.

probably not too scaleable because I calculated the NA variations that could be made from the fully specifed rows , but might serve to get you started .(or might be sufficient for your purpose)
note I added extra rows to test it out, I added a final row which could not match. and I added rows with NA in both positions, as these need to be removed because they will match with any other row.

library(tidyverse)
(df3 <- data.frame(
  V1 = c("A","B","C","C","B","B","A",NA,NA,NA,NA),
  V2 = c("X","Y","Z",NA,"W","W",NA,"X",NA,NA,"D")
))

(pure_uniques <- df3 %>% na.omit() %>% distinct())
(questionable_uniques <- setdiff(df3,pure_uniques) %>%
  filter(! (is.na(V1) & is.na(V2))))

(gen_pos <- map_dfr(names(df3),~{
  x<- .x
  pure_uniques %>% mutate({{.x}}:=NA,
                          delete_flag = TRUE)
}) %>% distinct())


(remaining_questionables <- left_join(questionable_uniques,
          gen_pos) %>% rowwise() %>% filter(!identical(delete_flag,TRUE)) %>% ungroup %>% select(-delete_flag))

(final_count <- n_distinct(pure_uniques) + 
                n_distinct(remaining_questionables))

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.