how to get deleted duplicate rows in data frame using distinct ()

adil · April 2, 2021, 2:29pm

hey all,
i have used distinct function to delete the duplicate rows, but later i want to use those deleted rows again (deleted rows must be saved in seperate data frame) please guide wat to do.

df <- tibble(
  x = sample(10, 10, rep = TRUE),
  y = sample(10, 10, rep = TRUE)
)
#> Error in tibble(x = sample(10, 10, rep = TRUE), y = sample(10, 10, rep = TRUE)): could not find function "tibble"
df
#> function (x, df1, df2, ncp, log = FALSE) 
#> {
#>     if (missing(ncp)) 
#>         .Call(C_df, x, df1, df2, log)
#>     else .Call(C_dnf, x, df1, df2, ncp, log)
#> }
#> <bytecode: 0x00000000137c4c58>
#> <environment: namespace:stats>
nrow(df)
#> NULL
nrow(distinct(df))
#> Error in distinct(df): could not find function "distinct"
nrow(distinct(df, x, y))
#> Error in distinct(df, x, y): could not find function "distinct"

distinct(df, x)
#> Error in distinct(df, x): could not find function "distinct"
distinct(df, y)
#> Error in distinct(df, y): could not find function "distinct"

^{Created on 2021-04-02 by the reprex package (v0.3.0)}

mara · April 2, 2021, 3:05pm

FYI, your reprex didn't work because you need to include the library call.

Here's a working reprex, where I've "found" the repeated rows using count().

library(tidyverse)
df <- tibble(
  x = sample(10, 10, rep = TRUE),
  y = sample(10, 10, rep = TRUE)
)
df
#> # A tibble: 10 x 2
#>        x     y
#>    <int> <int>
#>  1     3     3
#>  2     8     3
#>  3     9    10
#>  4     4     8
#>  5     4     6
#>  6     4     7
#>  7    10     3
#>  8    10     3
#>  9     9     9
#> 10     4     1
nrow(df)
#> [1] 10
nrow(distinct(df))
#> [1] 9

df_distinct <- distinct(df)

df %>% 
  count(x, y) %>%
  filter(n > 1)
#> # A tibble: 1 x 3
#>       x     y     n
#>   <int> <int> <int>
#> 1    10     3     2

^{Created on 2021-04-02 by the reprex package (v1.0.0)}

adil · April 2, 2021, 5:23pm

i think u dint get my point,
i need to store the deleted rows in data frame , that is, duplicated rows which omitted during distinct ().

Yarnabrina · April 3, 2021, 9:29am

If I understand your requirements correctly, then I do not know whether it's possible directly via distinct or not. But here are two workarounds:

# setup

set.seed(seed = 100732)

df <- data.frame(
    x = sample.int(n = 4, size = 10, replace = TRUE),
    y = sample.int(n = 4, size = 10, replace = TRUE)
)
df
#>    x y
#> 1  3 3
#> 2  1 3
#> 3  1 4
#> 4  2 2
#> 5  2 2
#> 6  3 2
#> 7  2 4
#> 8  1 4
#> 9  2 1
#> 10 3 4

# base

z <- duplicated(x = df)
z
#>  [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE

distinct_rows <- df[!z,]
distinct_rows
#>    x y
#> 1  3 3
#> 2  1 3
#> 3  1 4
#> 4  2 2
#> 6  3 2
#> 7  2 4
#> 9  2 1
#> 10 3 4

duplicate_rows <- df[z,]
duplicate_rows
#>   x y
#> 5 2 2
#> 8 1 4

# dplyr

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

counts <- df %>%
    count(x, y,
          name = "n")
counts
#>   x y n
#> 1 1 3 1
#> 2 1 4 2
#> 3 2 1 1
#> 4 2 2 2
#> 5 2 4 1
#> 6 3 2 1
#> 7 3 3 1
#> 8 3 4 1

distinct_rows <- counts %>%
    select(x, y)
distinct_rows
#>   x y
#> 1 1 3
#> 2 1 4
#> 3 2 1
#> 4 2 2
#> 5 2 4
#> 6 3 2
#> 7 3 3
#> 8 3 4

duplicate_rows <- counts %>%
    filter(n > 1) %>%
    select(x, y)
duplicate_rows
#>   x y
#> 1 1 4
#> 2 2 2

^{Created on 2021-04-03 by the reprex package (v2.0.0)}

system · April 13, 2021, 2:47am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.