Find values in dataframe that are duplicated in every column

Hello I have a data frame, and would like to return values that are duplicated in every column. For instance, I have this data frame:

df <- data.frame(
  A = c(1,0,9,6,0),
  B = c(0,1,0,0,0),
  C = c(8,0,1,4,0),
  D = c(3,0,4,1,0),
  E = c(0,0,0,0,1)
)
df

  A B C D E
1 1 0 8 3 0
2 0 1 0 0 0
3 9 0 1 4 0
4 6 0 4 1 0
5 0 0 0 0 1

The computer would then return 0 and 1 because these are the only values that are in every column.

Thanks!

To follow up: How could I tweak the solution to make it more permissive, to report values that are duplicates not in all columns, but in n-1, n-2 columns etc..

Like in the above matrix, have it report "4" if I input a minimum duplication of two columns?

Just a question of vocabulary: here you mean values that are present in all columns, right? As "duplicated" usually is used for "repeated within the same set".

In this case, we can start by finding the values in each column, removing duplicates:

unique(df$A)
#> [1] 1 0 9 6
unique(df$B)
#> [1] 0 1
...

We can automate this and get a list of the values in each column (where each element of the list corresponds to a column):

vals_per_col <- lapply(df, unique)

vals_per_col
#> $A
#> [1] 1 0 9 6
#> 
#> $B
#> [1] 0 1
#> 
#> $C
#> [1] 8 0 1 4
#> 
#> $D
#> [1] 3 0 4 1
#> 
#> $E
#> [1] 0 1

So now if your goal is to find in how many columns a given value appears, the easiest is just to find how many times it appears in the list (since we used unique() to ensure a value appears only once per column):

table(unlist(vals_per_col))
#> 0 1 3 4 6 8 9 
#> 5 5 1 2 1 1 1 

So you directly see that 0 and 1 appear in all 5 columns, 4 appears in two, and the others appear only in a single column. This can then be automated with some filtering:

tab <- table(unlist(vals_per_col))

names(tab)[tab >= 5]
#> [1] "0" "1"

Wow thank you so much, this is exactly what I want and very elegant. Much appreciated

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.