flag and remove nearly identical rows in a dataframe

Is there a workaround function that could remove or at least flag some cases like this below , where all the values in two rows are identical except one that is not.

df <- tribble(~x, ~y, ~z,
           0.1, 0.2, 0.3,
           0.1,0.4,0.3,
           0.1,0.2,0.3,
           0.6,0.9,0.8)

distinct(df)

What I would like is to get only rows 1 and 4 or also helpful would it be if I could filter those rows out to recheck them manually.

Thanks for the help.

distinct() can filter unique rows based on one or more columns of the data frame. To specify the columns to use for filtering, you can pass their names as arguments to the distinct(df, x, z)

Here's something that classifies rows into clusters that would work for a case like the example.

d <- structure(c(
  0.1, 0.1, 0.1, 0.6, 0.2, 0.4, 0.2, 0.9, 0.3, 0.3,
  0.3, 0.8
), dim = 4:3, dimnames = list(NULL, c("x", "y", "z")))


(d <- cbind(d,e1071::cmeans(d,2)$cluster))
#>        x   y   z  
#> [1,] 0.1 0.2 0.3 1
#> [2,] 0.1 0.4 0.3 1
#> [3,] 0.1 0.2 0.3 1
#> [4,] 0.6 0.9 0.8 2

Created on 2023-02-24 with reprex v2.0.2

Thanks for the suggestion. This won't work for me since, I cannot expect in which column(s) actually the discrepancy would happen.

It is really a smart way of flagging those nearly identical rows. But if the there is no such a case, it will still give two clusters (1 & 2), which is still good but requires manual checking to check whether those are really nearly identical or just a forced clustering. Do you agree with me?

Thinking about it more, nearly identical becomes illusive.

{n \choose k} for n = 9 & k = 3 is 79

There are 79 combinations of 9 elements taken three at a time without regard to order. There are only a few rows that don't overlap (two common elements) with at least two other rows that do not overlap with each other.

(p <- t(combn(seq(0.1,0.9,0.1),3)))
#>       [,1] [,2] [,3]
#>  [1,]  0.1  0.2  0.3
#>  [2,]  0.1  0.2  0.4
#>  [3,]  0.1  0.2  0.5
#>  [4,]  0.1  0.2  0.6
#>  [5,]  0.1  0.2  0.7
#>  [6,]  0.1  0.2  0.8
#>  [7,]  0.1  0.2  0.9
#>  [8,]  0.1  0.3  0.4
#>  [9,]  0.1  0.3  0.5
#> [10,]  0.1  0.3  0.6
#> [11,]  0.1  0.3  0.7
#> [12,]  0.1  0.3  0.8
#> [13,]  0.1  0.3  0.9
#> [14,]  0.1  0.4  0.5
#> [15,]  0.1  0.4  0.6
#> [16,]  0.1  0.4  0.7
#> [17,]  0.1  0.4  0.8
#> [18,]  0.1  0.4  0.9
#> [19,]  0.1  0.5  0.6
#> [20,]  0.1  0.5  0.7
#> [21,]  0.1  0.5  0.8
#> [22,]  0.1  0.5  0.9
#> [23,]  0.1  0.6  0.7
#> [24,]  0.1  0.6  0.8
#> [25,]  0.1  0.6  0.9
#> [26,]  0.1  0.7  0.8
#> [27,]  0.1  0.7  0.9
#> [28,]  0.1  0.8  0.9
#> [29,]  0.2  0.3  0.4
#> [30,]  0.2  0.3  0.5
#> [31,]  0.2  0.3  0.6
#> [32,]  0.2  0.3  0.7
#> [33,]  0.2  0.3  0.8
#> [34,]  0.2  0.3  0.9
#> [35,]  0.2  0.4  0.5
#> [36,]  0.2  0.4  0.6
#> [37,]  0.2  0.4  0.7
#> [38,]  0.2  0.4  0.8
#> [39,]  0.2  0.4  0.9
#> [40,]  0.2  0.5  0.6
#> [41,]  0.2  0.5  0.7
#> [42,]  0.2  0.5  0.8
#> [43,]  0.2  0.5  0.9
#> [44,]  0.2  0.6  0.7
#> [45,]  0.2  0.6  0.8
#> [46,]  0.2  0.6  0.9
#> [47,]  0.2  0.7  0.8
#> [48,]  0.2  0.7  0.9
#> [49,]  0.2  0.8  0.9
#> [50,]  0.3  0.4  0.5
#> [51,]  0.3  0.4  0.6
#> [52,]  0.3  0.4  0.7
#> [53,]  0.3  0.4  0.8
#> [54,]  0.3  0.4  0.9
#> [55,]  0.3  0.5  0.6
#> [56,]  0.3  0.5  0.7
#> [57,]  0.3  0.5  0.8
#> [58,]  0.3  0.5  0.9
#> [59,]  0.3  0.6  0.7
#> [60,]  0.3  0.6  0.8
#> [61,]  0.3  0.6  0.9
#> [62,]  0.3  0.7  0.8
#> [63,]  0.3  0.7  0.9
#> [64,]  0.3  0.8  0.9
#> [65,]  0.4  0.5  0.6
#> [66,]  0.4  0.5  0.7
#> [67,]  0.4  0.5  0.8
#> [68,]  0.4  0.5  0.9
#> [69,]  0.4  0.6  0.7
#> [70,]  0.4  0.6  0.8
#> [71,]  0.4  0.6  0.9
#> [72,]  0.4  0.7  0.8
#> [73,]  0.4  0.7  0.9
#> [74,]  0.4  0.8  0.9
#> [75,]  0.5  0.6  0.7
#> [76,]  0.5  0.6  0.8
#> [77,]  0.5  0.6  0.9
#> [78,]  0.5  0.7  0.8
#> [79,]  0.5  0.7  0.9
#> [80,]  0.5  0.8  0.9
#> [81,]  0.6  0.7  0.8
#> [82,]  0.6  0.7  0.9
#> [83,]  0.6  0.8  0.9
#> [84,]  0.7  0.8  0.9

p1 <- p[1:7,]
p2 <- p[8:13,] 
p3 <- p[14:18,]
p4 <- p[19:22,]
p5 <- p[23:25,]
p6 <- p[26:28,]
p7 <- p[29:34,]
p8 <- p[35:39,]
p9 <- p[40:43,]
p10 <- p[44:46,]
p11 <- p[47:48,]
p12 <- p[49,]
p13 <- p[50:54,]
p14 <- p[55:58,]
p13 <- p[59:61,]
p14 <- p[62:63,]
p15 <- p[64,]
p16 <- p[65:68,]
p17 <- p[69:71,]
p18 <- p[72:73,]
p19 <- p[74,]
p20 <- p[75:77,]
p21 <- p[78:79,]
p22 <- p[80,]
p23 <- p[81:82,]
p24 <- p[83:84,]

(t(combn(seq(0.1,0.9,0.2),3)))
#>       [,1] [,2] [,3]
#>  [1,]  0.1  0.3  0.5
#>  [2,]  0.1  0.3  0.7
#>  [3,]  0.1  0.3  0.9
#>  [4,]  0.1  0.5  0.7
#>  [5,]  0.1  0.5  0.9
#>  [6,]  0.1  0.7  0.9
#>  [7,]  0.3  0.5  0.7
#>  [8,]  0.3  0.5  0.9
#>  [9,]  0.3  0.7  0.9
#> [10,]  0.5  0.7  0.9
1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.