Keeping duplicated rows across/between Ids

Hi I want to create two different datasets keeping only duplicated rows within IDs and between IDs respectively. Could anyone throw some light on this?

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

I have a dataset df with 4 columns ID, Visit, SYSBP and DIABP as follows

A 1 120 80
A 2 130 80
B 1 130 75
B 2 120 80
B 3 130 75
C 1 130 80
C 2 130 80

Now I want to create df_within as

B 1 130 75
B 3 130 75
C 1 130 80
C 2 130 80

and df_between as

A 1 120 80
B 2 120 80
A 2 130 80
C 1 130 80
A 2 130 80
C 2 130 80

How do I do this?

For the first output you can do something like this, unfortunately, I do not understand the logic for the second output, maybe with this reprex someone else could help you with that.

df <- data.frame(stringsAsFactors=FALSE,
                 ID = c("A", "A", "B", "B", "B", "C", "C"),
                 Visit = c(1, 2, 1, 2, 3, 1, 2),
                 SYSBP = c(120, 130, 130, 120, 130, 130, 130),
                 DIABP = c(80, 80, 75, 80, 75, 80, 80))
library(dplyr)

df %>% 
    add_count(ID, SYSBP, DIABP) %>% 
    filter(n >= 2) %>% 
    select(-n)
#> # A tibble: 4 x 4
#>   ID    Visit SYSBP DIABP
#>   <chr> <dbl> <dbl> <dbl>
#> 1 B         1   130    75
#> 2 B         3   130    75
#> 3 C         1   130    80
#> 4 C         2   130    80

Created on 2019-05-27 by the reprex package (v0.3.0)

One (quite ugly) way of getting df_between is as follows, which (I hope) can be made much simpler:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- data.frame(stringsAsFactors = FALSE,
                 ID = c("A", "A", "B", "B", "B", "C", "C"),
                 Visit = c(1, 2, 1, 2, 3, 1, 2),
                 SYSBP = c(120, 130, 130, 120, 130, 130, 130),
                 DIABP = c(80, 80, 75, 80, 75, 80, 80))

df %>%
  group_by(ID, SYSBP, DIABP) %>%
  filter(n() > 1) %>%
  ungroup()
#> # A tibble: 4 x 4
#>   ID    Visit SYSBP DIABP
#>   <chr> <dbl> <dbl> <dbl>
#> 1 B         1   130    75
#> 2 B         3   130    75
#> 3 C         1   130    80
#> 4 C         2   130    80

df %>%
  group_by(SYSBP, DIABP) %>%
  filter(n_distinct(ID) > 1) %>%
  group_split() %>%
  lapply(FUN = function(t) {
    apply(X = combn(x = nrow(x = t),
                    m = 2),
          MARGIN = 2,
          FUN = function(z) t[z, ])
  }) %>%
  lapply(FUN = function(t) {
    if (length(x = t) > 1) {
      lapply(X = t,
             FUN = function(z) filter(.data = z,
                                      n_distinct(ID) > 1))
    } else {
      t
    }
  }) %>%
  lapply(FUN = function(t) do.call(what = rbind,
                                   args = t)) %>%
  bind_rows()
#> # A tibble: 6 x 4
#>   ID    Visit SYSBP DIABP
#>   <chr> <dbl> <dbl> <dbl>
#> 1 A         1   120    80
#> 2 B         2   120    80
#> 3 A         2   130    80
#> 4 C         1   130    80
#> 5 A         2   130    80
#> 6 C         2   130    80

Created on 2019-05-28 by the reprex package (v0.3.0)

1 Like

Many thanks. I was able to do this in a shorter way by using lead and lag function to the IDs. Thanks anyway. Cheers!

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.