Opposite of distinct and selecting many variables

Hello, my question is two part:

  1. Is there a dpylr/tidyverse function that is the equivalent of the opposite of distinct()? ie shows are rows that are not unique.
  2. How do you pass many variables(column names) to distinct or add_count without inputting each one? I'm think you probably need some combination of !!, enquo() and vars() but I could not figure it out.
    Example:
df <- dplyr::tribble(~t, ~u, ~v, ~w, ~x, ~y, ~z, 
                     1, "a", 'a', 'b', 4, 'c', NA, 
                     1, "a", 'a', 'b', 4, 'c', 5, 
                     2, "a", 'a', 'b', 4, 'c', 5, 
                     3, "b", 'b', 'b', 8, 'c', 9, 
                     3, "a", 'b', 'b', 10, 'c', 25, 
                     4, "c", 'a', 'b', 4, 'c', 5, 
                     4, "c", 'a', 'b', 4, 'c', NA)

## want a tibble with duplicates removed, but don't use last column in 
## identifying duplicates
df %>% distinct(t, u, v, w, x, y, .keep_all = T)
# A tibble: 5 x 7
#       t u     v     w         x y         z
#   <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
# 1     1 a     a     b         4 c        NA
# 2     2 a     a     b         4 c         5
# 3     3 b     b     b         8 c         9
# 4     3 a     b     b        10 c        25
# 5     4 c     a     b         4 c         5

## want to look at all duplicates, again excluding last column from finding 
## duplicates
df %>% add_count(t, u, v, w, x, y) %>% filter(n > 1)
# A tibble: 4 x 8
#      t u     v     w         x y         z     n
# <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <int>
# 1     1 a     a     b         4 c        NA     2
# 2     1 a     a     b         4 c         5     2
# 3     4 c     a     b         4 c         5     2
# 4     4 c     a     b         4 c        NA     2

This is my first post. I have tried to follow all the guidelines but am happy for feedback/correction.

1 Like

Hello and welcome! I have to do this a lot too- I think you can do what you need to do to with the dplyr scoped variants, but you'll need to do this with group_by_at and your own mutate rather than using add_count (which is a shortcut for that sequence because it is so common).

You are going about the problem in the same way I do to find uniques and duplicates- there are no special functions for this other than the janitor::get_dupes function (janitor package here: https://github.com/sfirke/janitor) I show below, but you can get the same output by counting rows with dplyr:

library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.5.2
#> Warning: package 'purrr' was built under R version 3.5.2

df <- tibble::tribble(~t, ~u, ~v, ~w, ~x, ~y, ~z, 
                     1, "a", 'a', 'b', 4, 'c', NA, 
                     1, "a", 'a', 'b', 4, 'c', 5, 
                     2, "a", 'a', 'b', 4, 'c', 5, 
                     3, "b", 'b', 'b', 8, 'c', 9, 
                     3, "a", 'b', 'b', 10, 'c', 25, 
                     4, "c", 'a', 'b', 4, 'c', 5, 
                     4, "c", 'a', 'b', 4, 'c', NA)

## want a tibble with duplicates removed, but don't use last column in 
## identifying duplicates
df %>% 
  add_count(t, u, v, w, x, y) %>% 
  filter(n == 1)
#> # A tibble: 3 x 8
#>       t u     v     w         x y         z     n
#>   <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <int>
#> 1     2 a     a     b         4 c         5     1
#> 2     3 b     b     b         8 c         9     1
#> 3     3 a     b     b        10 c        25     1

# same as...
df %>% 
  group_by_at(vars(-z)) %>%  
  mutate(num_rows = sum(n())) %>% 
  filter(num_rows == 1)
#> # A tibble: 3 x 8
#> # Groups:   t, u, v, w, x, y [3]
#>       t u     v     w         x y         z num_rows
#>   <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>    <int>
#> 1     2 a     a     b         4 c         5        1
#> 2     3 b     b     b         8 c         9        1
#> 3     3 a     b     b        10 c        25        1

## want to look at all duplicates, again excluding last column from finding 
## duplicates
df %>% 
  add_count(t, u, v, w, x, y) %>% 
  filter(n > 1)
#> # A tibble: 4 x 8
#>       t u     v     w         x y         z     n
#>   <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <int>
#> 1     1 a     a     b         4 c        NA     2
#> 2     1 a     a     b         4 c         5     2
#> 3     4 c     a     b         4 c         5     2
#> 4     4 c     a     b         4 c        NA     2

# same as...
df %>% 
  group_by_at(vars(-z)) %>%  
  mutate(num_rows = sum(n())) %>% 
  filter(num_rows > 1)
#> # A tibble: 4 x 8
#> # Groups:   t, u, v, w, x, y [2]
#>       t u     v     w         x y         z num_rows
#>   <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>    <int>
#> 1     1 a     a     b         4 c        NA        2
#> 2     1 a     a     b         4 c         5        2
#> 3     4 c     a     b         4 c         5        2
#> 4     4 c     a     b         4 c        NA        2

## use janitor package to look at duplicates
library(janitor)
df %>% 
  get_dupes(t, u, v, w, x, y)
#> # A tibble: 4 x 8
#>       t u     v     w         x y     dupe_count     z
#>   <dbl> <chr> <chr> <chr> <dbl> <chr>      <int> <dbl>
#> 1     1 a     a     b         4 c              2    NA
#> 2     1 a     a     b         4 c              2     5
#> 3     4 c     a     b         4 c              2     5
#> 4     4 c     a     b         4 c              2    NA

Created on 2019-02-01 by the reprex package (v0.2.1)

8 Likes

Thank you so much, this is just what I needed!

1 Like

I usually do something along the lines of this:

set.seed(199037)
tibble(x = sample(LETTERS, 50, replace = TRUE)) %>% count(x) %>% filter(n > 1)
# A tibble: 17 x 2
   x         n
   <chr> <int>
 1 A         2
 2 B         2
 3 C         4
 4 F         2
 5 J         2
 6 K         2
 7 M         2
 8 N         2
 9 O         2
10 P         2
11 Q         4
12 R         4
13 T         2
14 U         3
15 W         3
16 Y         2
17 Z         3

So basically, group by x and count the number of observations in each group, then filter away any groups with just one observation

Hope it helps :slightly_smiling_face:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.