 # Number of unique rows including unknown

Is there a way to count the number of unique rows in a data.frame accounting for unknown values. I need uniqueness under the assumption that missing values can be anything. Here are a few examples of what I mean:

``````df1 <- data.frame(
V1 = c("A","B","C","D"),
V2 = c("X","Y","Z","W")
)

> df1
V1 V2
1  A  X
2  B  Y
3  C  Z
4  D  W
``````

Would return `4`, as there are 4 unique values, this is the same as `nrow(unique(df1))`. However the following:

``````df2 <- data.frame(
V1 = c("A","B","C","C"),
V2 = c("X","Y","Z",NA)
)

> df2
V1   V2
1  A    X
2  B    Y
3  C    Z
4  C <NA>
``````

Would return `3` as the bottom row could be identical to the 3rd row.

``````df3 <- data.frame(
V1 = c("A","B","C","C","B","B","A",NA),
V2 = c("X","Y","Z",NA,"W","W",NA,"X")
)

> df3
V1   V2
1    A    X
2    B    Y
3    C    Z
4    C <NA>
5    B    W
6    B    W
7    A <NA>
8 <NA>    X
``````

Would return `4` since we would count rows 1,2 & 3 as unique, plus rows 5 & 6 are the same. Row 4 could be the same as row 3 and so does not increase the count and rows 7 & 8 could be the same as row 1 and so also do not affect the count.

probably not too scaleable because I calculated the NA variations that could be made from the fully specifed rows , but might serve to get you started .(or might be sufficient for your purpose)
note I added extra rows to test it out, I added a final row which could not match. and I added rows with NA in both positions, as these need to be removed because they will match with any other row.

``````library(tidyverse)
(df3 <- data.frame(
V1 = c("A","B","C","C","B","B","A",NA,NA,NA,NA),
V2 = c("X","Y","Z",NA,"W","W",NA,"X",NA,NA,"D")
))

(pure_uniques <- df3 %>% na.omit() %>% distinct())
(questionable_uniques <- setdiff(df3,pure_uniques) %>%
filter(! (is.na(V1) & is.na(V2))))

(gen_pos <- map_dfr(names(df3),~{
x<- .x
pure_uniques %>% mutate({{.x}}:=NA,
delete_flag = TRUE)
}) %>% distinct())

(remaining_questionables <- left_join(questionable_uniques,
gen_pos) %>% rowwise() %>% filter(!identical(delete_flag,TRUE)) %>% ungroup %>% select(-delete_flag))

(final_count <- n_distinct(pure_uniques) +
n_distinct(remaining_questionables))``````

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.