handling duplicates

Hi. Need help with coding.

I have a dataset with two variables (Name and Response).
I want to add two more variables, Obs_No and Freq.
Obs_No is observation number. If name is duplicated, Obs_No should be the same (e.g. Josh and Anna). Freq is number of times the name appears. (see table).

Name Response Obs_No Freq
Ben 1 1 1
Marie 0 2 1
Josh 1 3 2
Josh 0 3 2
Ana 0 4 3
Ana 0 4 3
Ana 1 4 3

Thanks for the help!

yoyong

If you use data.table there is function called unique which can help you and you can pass multiple arguement to it... There is a also a function by name duplicated which can also be used

Thanks. But not really familiar with R. Anyway, will do some readings. Much appreciated.

The best answer depends on your final goal. unique() will give you the unique names in "Names". If you need the number of unique names you can count(unique()). If you need a column that identifies unique names, then I would start by sorting the names to make sure that all cases of a name are together. Then use a for loop and if test to go through and find where Name changes. If the name has changed from previous entry then increment Obs_No, else do nothing. In dplyr the sorting is done using arrange().

I really like the tidyverse's dplyr distinct function.

library(dplyr)
df <- tribble(~Name, ~Response, ~Obs_No, ~Freq,
'Ben', 1, 1, 1,
'Marie', 0, 2, 1,
'Josh', 1, 3, 2,
'Josh', 0, 3, 2,
'Ana', 0, 4, 3,
'Ana', 0, 4, 3,
'Ana', 1, 4, 3)

df %>% 
  distinct()
#> # A tibble: 6 x 4
#>   Name  Response Obs_No  Freq
#>   <chr>    <dbl>  <dbl> <dbl>
#> 1 Ben          1      1     1
#> 2 Marie        0      2     1
#> 3 Josh         1      3     2
#> 4 Josh         0      3     2
#> 5 Ana          0      4     3
#> 6 Ana          1      4     3

AND just to show other options under distinct

library(dplyr)


df %>% 
  distinct(Name, Response, .keep_all = TRUE)
#> # A tibble: 6 x 4
#>   Name  Response Obs_No  Freq
#>   <chr>    <dbl>  <dbl> <dbl>
#> 1 Ben          1      1     1
#> 2 Marie        0      2     1
#> 3 Josh         1      3     2
#> 4 Josh         0      3     2
#> 5 Ana          0      4     3
#> 6 Ana          1      4     3

df %>% 
  distinct(Name, Obs_No, .keep_all = TRUE)
#> # A tibble: 4 x 4
#>   Name  Response Obs_No  Freq
#>   <chr>    <dbl>  <dbl> <dbl>
#> 1 Ben          1      1     1
#> 2 Marie        0      2     1
#> 3 Josh         1      3     2
#> 4 Ana          0      4     3

Created on 2021-04-27 by the reprex package (v2.0.0)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.