Unite based on duplicated value?

nviau · December 13, 2017, 6:37pm

I want to unite two columns but only when there is a duplicate value in one of the columns. A simple example of the current data:

library(tidyverse)
tibble(x1 = c("joe", "ann", "ann", "sam", "sam", "ken"),
       x2 = c("boston", "new york", "hartford", "montreal", "boise", "tulsa"))
#> # A tibble: 6 x 2
#>      x1       x2
#>   <chr>    <chr>
#> 1   joe   boston
#> 2   ann new york
#> 3   ann hartford
#> 4   sam montreal
#> 5   sam    boise
#> 6   ken    tulsa

And what I want it to look like:

#> # A tibble: 6 x 2
#>               x1       x2
#>            <chr>    <chr>
#> 1            joe   boston
#> 2 ann - new york new york
#> 3 ann - hartford hartford
#> 4 sam - montreal montreal
#> 5    sam - boise    boise
#> 6            ken    tulsa

I have an mildly onerous workaround but I know there must be a more elegant way.

(This is my first post; sorry for any errors in etiquette )

nick · December 13, 2017, 7:04pm

I'm not sure I'd describe this as "elegant", and I assume one of the R/tidyverse masters will have a better way, but it works...

suppressPackageStartupMessages(library(tidyverse))
dat <- tibble(x1 = c("joe", "ann", "ann", "sam", "sam", "ken"),
              x2 = c("boston", "new york", "hartford", "montreal", "boise", "tulsa"))

dat %>%
  group_by(x1) %>%
  mutate(x1_temp = if(n() == 1) x1 else paste0(x1, " - ", x2)) %>%
  ungroup() %>%
  select(x1 = x1_temp, x2)

mfherman · December 13, 2017, 7:05pm

I'm not sure how elegant this is either, but it seems to get the job done. The first duplicated() call identifies the second (or further) instance of the duplicated row and then I combine it with the fromLast = TRUE option to identify the first instance of the duplicate.

library(dplyr)

df <- tibble(
  x1 = c("joe", "ann", "ann", "sam", "sam", "ken"),
  x2 = c("boston", "new york", "hartford", "montreal", "boise", "tulsa")
)

df %>%
  mutate(x1 = if_else(
    duplicated(x1) | duplicated(x1, fromLast = TRUE),
    paste0(x1, " - ", x2),
    x1
  ))
#> # A tibble: 6 x 2
#>               x1       x2
#>            <chr>    <chr>
#> 1            joe   boston
#> 2 ann - new york new york
#> 3 ann - hartford hartford
#> 4 sam - montreal montreal
#> 5    sam - boise    boise
#> 6            ken    tulsa

Adding: and here's another way of doing it. Elegance not guaranteed!

df %>%
  mutate(x1 = if_else(
    x1 %in% x1[duplicated(x1)],
    paste0(x1, " - ", x2),
    x1
  ))
#> # A tibble: 6 x 2
#>               x1       x2
#>            <chr>    <chr>
#> 1            joe   boston
#> 2 ann - new york new york
#> 3 ann - hartford hartford
#> 4 sam - montreal montreal
#> 5    sam - boise    boise
#> 6            ken    tulsa

nick · December 13, 2017, 7:20pm

That bit is probably better written as unique(x1) (edit: actually, no), but that's otherwise the alteration I was thinking of suggesting to your original code. I personally think it's cleaner than what I had.

mfherman · December 13, 2017, 7:25pm

But won't unique(x1) here identify all the unique elements of the vector, rather than just those that are duplicated?

library(dplyr)

df <- tibble(
  x1 = c("joe", "ann", "ann", "sam", "sam", "ken"),
  x2 = c("boston", "new york", "hartford", "montreal", "boise", "tulsa")
)


df %>%
  mutate(x1 = if_else(
    x1 %in% unique(x1),
    paste0(x1, " - ", x2),
    x1
  ))
#> # A tibble: 6 x 2
#>               x1       x2
#>            <chr>    <chr>
#> 1   joe - boston   boston
#> 2 ann - new york new york
#> 3 ann - hartford hartford
#> 4 sam - montreal montreal
#> 5    sam - boise    boise
#> 6    ken - tulsa    tulsa

nick · December 13, 2017, 7:36pm

You are definitely right; fingers got ahead of my brain there.