Identify the duplicates

Hi!

I've a large dataset, where there are lots of duplicates. For the analysis, I need to know:

  1. which records are duplicates
  2. of which record they are duplicates of

Let me provide something similar to what I have, what I want and what I've done till now:

# setting seed for reproducibility
set.seed(seed = 1701)

# generating random data
x <- y <- z <- rep(x = 1:4, each = 4)
d <- cbind(sample(x = x),
           sample(x = y),
           sample(x = z))

# checking duplicates
(t <- duplicated(d))
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#> [12] FALSE FALSE  TRUE FALSE FALSE

# desired result
s <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 7, 10, 11, 12, 10, 13, 14)
(e <- data.frame(d, t = t, s = s))
#>    X1 X2 X3     t  s
#> 1   1  3  4 FALSE  1
#> 2   3  4  3 FALSE  2
#> 3   1  2  4 FALSE  3
#> 4   2  4  4 FALSE  4
#> 5   4  2  1 FALSE  5
#> 6   1  3  3 FALSE  6
#> 7   2  1  2 FALSE  7
#> 8   2  4  3 FALSE  8
#> 9   4  3  2 FALSE  9
#> 10  2  1  2  TRUE  7
#> 11  3  2  1 FALSE 10
#> 12  4  3  4 FALSE 11
#> 13  4  1  2 FALSE 12
#> 14  3  2  1  TRUE 10
#> 15  1  4  3 FALSE 13
#> 16  3  1  1 FALSE 14

# attempt
start <- 1
index <- integer(length = nrow(x = d))
for (i in 1:nrow(x = d))
{
  if (t[i] == FALSE)
  {
    index[i] <- start
    start <- (start + 1)
  } else
  {
    for (j in 1:(i - 1))
    {
      if (identical(x = d[i,], y = d[j,]))
      {
        index[i] <- index[j]
        break
      }
    }
  }
}

# result
(f <- data.frame(e, r = index))
#>    X1 X2 X3     t  s  r
#> 1   1  3  4 FALSE  1  1
#> 2   3  4  3 FALSE  2  2
#> 3   1  2  4 FALSE  3  3
#> 4   2  4  4 FALSE  4  4
#> 5   4  2  1 FALSE  5  5
#> 6   1  3  3 FALSE  6  6
#> 7   2  1  2 FALSE  7  7
#> 8   2  4  3 FALSE  8  8
#> 9   4  3  2 FALSE  9  9
#> 10  2  1  2  TRUE  7  7
#> 11  3  2  1 FALSE 10 10
#> 12  4  3  4 FALSE 11 11
#> 13  4  1  2 FALSE 12 12
#> 14  3  2  1  TRUE 10 10
#> 15  1  4  3 FALSE 13 13
#> 16  3  1  1 FALSE 14 14

Created on 2019-02-09 by the reprex package (v0.2.1)

So, basically my problem is solved, but I suppose there are better ways to do it.

Any suggestions?

Thanks.

I see you specifically tagged base-r here, but a recent thread provided some tidyverse options for looking at duplicate rows: Return the unique records and the last duplicated row

Also this tweet thread has some base R tips: https://twitter.com/WeAreRLadies/status/1093541404985352192

1 Like

Perhaps hutils::duplicated_rows(as.data.table(d))?

 library(data.table)
 library(hutils)
  DT <- data.table(x = rep(1:4, 3),
                   y = rep(1:2, 6),
                   z = rep(1:3, 4))

  # No duplicates
  duplicated_rows(DT)
#> Empty data.table (0 rows) of 3 cols: x,y,z

  # x and y have duplicates
  duplicated_rows(DT, by = c("x", "y"), order = FALSE)
#>     x y z
#>  1: 1 1 1
#>  2: 2 2 2
#>  3: 3 1 3
#>  4: 4 2 1
#>  5: 1 1 2
#>  6: 2 2 3
#>  7: 3 1 1
#>  8: 4 2 2
#>  9: 1 1 3
#> 10: 2 2 1
#> 11: 3 1 2
#> 12: 4 2 3

  # By default, the duplicate rows are presented adjacent to each other.
  duplicated_rows(DT, by = c("x", "y"))
#>     x y z
#>  1: 1 1 1
#>  2: 1 1 2
#>  3: 1 1 3
#>  4: 2 2 2
#>  5: 2 2 3
#>  6: 2 2 1
#>  7: 3 1 3
#>  8: 3 1 1
#>  9: 3 1 2
#> 10: 4 2 1
#> 11: 4 2 2
#> 12: 4 2 3

Created on 2019-02-10 by the reprex package (v0.2.1)

1 Like

I really like this solution.

Is it possible to get an index column (as in my example) while I use order = FALSE?

Could you add an index column to DT itself?

DT[, I := .I]

then duplicated_rows(DT, ...)

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.