Opposite of distinct and selecting many variables

carolineL · February 1, 2019, 6:16pm

Hello, my question is two part:

Is there a dpylr/tidyverse function that is the equivalent of the opposite of distinct()? ie shows are rows that are not unique.
How do you pass many variables(column names) to distinct or add_count without inputting each one? I'm think you probably need some combination of !!, enquo() and vars() but I could not figure it out.
Example:

df <- dplyr::tribble(~t, ~u, ~v, ~w, ~x, ~y, ~z, 
                     1, "a", 'a', 'b', 4, 'c', NA, 
                     1, "a", 'a', 'b', 4, 'c', 5, 
                     2, "a", 'a', 'b', 4, 'c', 5, 
                     3, "b", 'b', 'b', 8, 'c', 9, 
                     3, "a", 'b', 'b', 10, 'c', 25, 
                     4, "c", 'a', 'b', 4, 'c', 5, 
                     4, "c", 'a', 'b', 4, 'c', NA)

## want a tibble with duplicates removed, but don't use last column in 
## identifying duplicates
df %>% distinct(t, u, v, w, x, y, .keep_all = T)
# A tibble: 5 x 7
#       t u     v     w         x y         z
#   <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
# 1     1 a     a     b         4 c        NA
# 2     2 a     a     b         4 c         5
# 3     3 b     b     b         8 c         9
# 4     3 a     b     b        10 c        25
# 5     4 c     a     b         4 c         5

## want to look at all duplicates, again excluding last column from finding 
## duplicates
df %>% add_count(t, u, v, w, x, y) %>% filter(n > 1)
# A tibble: 4 x 8
#      t u     v     w         x y         z     n
# <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <int>
# 1     1 a     a     b         4 c        NA     2
# 2     1 a     a     b         4 c         5     2
# 3     4 c     a     b         4 c         5     2
# 4     4 c     a     b         4 c        NA     2

This is my first post. I have tried to follow all the guidelines but am happy for feedback/correction.

apreshill · February 1, 2019, 7:43pm

Hello and welcome! I have to do this a lot too- I think you can do what you need to do to with the dplyr scoped variants, but you'll need to do this with group_by_at and your own mutate rather than using add_count (which is a shortcut for that sequence because it is so common).

You are going about the problem in the same way I do to find uniques and duplicates- there are no special functions for this other than the janitor::get_dupes function (janitor package here: https://github.com/sfirke/janitor) I show below, but you can get the same output by counting rows with dplyr:

library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.5.2
#> Warning: package 'purrr' was built under R version 3.5.2

df <- tibble::tribble(~t, ~u, ~v, ~w, ~x, ~y, ~z, 
                     1, "a", 'a', 'b', 4, 'c', NA, 
                     1, "a", 'a', 'b', 4, 'c', 5, 
                     2, "a", 'a', 'b', 4, 'c', 5, 
                     3, "b", 'b', 'b', 8, 'c', 9, 
                     3, "a", 'b', 'b', 10, 'c', 25, 
                     4, "c", 'a', 'b', 4, 'c', 5, 
                     4, "c", 'a', 'b', 4, 'c', NA)

## want a tibble with duplicates removed, but don't use last column in 
## identifying duplicates
df %>% 
  add_count(t, u, v, w, x, y) %>% 
  filter(n == 1)
#> # A tibble: 3 x 8
#>       t u     v     w         x y         z     n
#>   <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <int>
#> 1     2 a     a     b         4 c         5     1
#> 2     3 b     b     b         8 c         9     1
#> 3     3 a     b     b        10 c        25     1

# same as...
df %>% 
  group_by_at(vars(-z)) %>%  
  mutate(num_rows = sum(n())) %>% 
  filter(num_rows == 1)
#> # A tibble: 3 x 8
#> # Groups:   t, u, v, w, x, y [3]
#>       t u     v     w         x y         z num_rows
#>   <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>    <int>
#> 1     2 a     a     b         4 c         5        1
#> 2     3 b     b     b         8 c         9        1
#> 3     3 a     b     b        10 c        25        1

## want to look at all duplicates, again excluding last column from finding 
## duplicates
df %>% 
  add_count(t, u, v, w, x, y) %>% 
  filter(n > 1)
#> # A tibble: 4 x 8
#>       t u     v     w         x y         z     n
#>   <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <int>
#> 1     1 a     a     b         4 c        NA     2
#> 2     1 a     a     b         4 c         5     2
#> 3     4 c     a     b         4 c         5     2
#> 4     4 c     a     b         4 c        NA     2

# same as...
df %>% 
  group_by_at(vars(-z)) %>%  
  mutate(num_rows = sum(n())) %>% 
  filter(num_rows > 1)
#> # A tibble: 4 x 8
#> # Groups:   t, u, v, w, x, y [2]
#>       t u     v     w         x y         z num_rows
#>   <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>    <int>
#> 1     1 a     a     b         4 c        NA        2
#> 2     1 a     a     b         4 c         5        2
#> 3     4 c     a     b         4 c         5        2
#> 4     4 c     a     b         4 c        NA        2

## use janitor package to look at duplicates
library(janitor)
df %>% 
  get_dupes(t, u, v, w, x, y)
#> # A tibble: 4 x 8
#>       t u     v     w         x y     dupe_count     z
#>   <dbl> <chr> <chr> <chr> <dbl> <chr>      <int> <dbl>
#> 1     1 a     a     b         4 c              2    NA
#> 2     1 a     a     b         4 c              2     5
#> 3     4 c     a     b         4 c              2     5
#> 4     4 c     a     b         4 c              2    NA

^{Created on 2019-02-01 by the reprex package (v0.2.1)}

carolineL · February 1, 2019, 8:25pm

Thank you so much, this is just what I needed!

Leon · February 2, 2019, 8:33am

I usually do something along the lines of this:

set.seed(199037)
tibble(x = sample(LETTERS, 50, replace = TRUE)) %>% count(x) %>% filter(n > 1)
# A tibble: 17 x 2
   x         n
   <chr> <int>
 1 A         2
 2 B         2
 3 C         4
 4 F         2
 5 J         2
 6 K         2
 7 M         2
 8 N         2
 9 O         2
10 P         2
11 Q         4
12 R         4
13 T         2
14 U         3
15 W         3
16 Y         2
17 Z         3

So basically, group by x and count the number of observations in each group, then filter away any groups with just one observation

Hope it helps

system · February 9, 2019, 8:33am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.