handling duplicates

yoyong · April 20, 2021, 9:03am

Hi. Need help with coding.

I have a dataset with two variables (Name and Response).
I want to add two more variables, Obs_No and Freq.
Obs_No is observation number. If name is duplicated, Obs_No should be the same (e.g. Josh and Anna). Freq is number of times the name appears. (see table).

Name Response Obs_No Freq
Ben 1 1 1
Marie 0 2 1
Josh 1 3 2
Josh 0 3 2
Ana 0 4 3
Ana 0 4 3
Ana 1 4 3

Thanks for the help!

yoyong

Anantadinath · April 20, 2021, 9:34am

If you use data.table there is function called unique which can help you and you can pass multiple arguement to it... There is a also a function by name duplicated which can also be used

yoyong · April 20, 2021, 9:47am

Thanks. But not really familiar with R. Anyway, will do some readings. Much appreciated.

Bugs · April 27, 2021, 9:10pm

The best answer depends on your final goal. unique() will give you the unique names in "Names". If you need the number of unique names you can count(unique()). If you need a column that identifies unique names, then I would start by sorting the names to make sure that all cases of a name are together. Then use a for loop and if test to go through and find where Name changes. If the name has changed from previous entry then increment Obs_No, else do nothing. In dplyr the sorting is done using arrange().

EconomiCurtis · April 27, 2021, 10:23pm

I really like the tidyverse's dplyr distinct function.

library(dplyr)
df <- tribble(~Name, ~Response, ~Obs_No, ~Freq,
'Ben', 1, 1, 1,
'Marie', 0, 2, 1,
'Josh', 1, 3, 2,
'Josh', 0, 3, 2,
'Ana', 0, 4, 3,
'Ana', 0, 4, 3,
'Ana', 1, 4, 3)

df %>% 
  distinct()
#> # A tibble: 6 x 4
#>   Name  Response Obs_No  Freq
#>   <chr>    <dbl>  <dbl> <dbl>
#> 1 Ben          1      1     1
#> 2 Marie        0      2     1
#> 3 Josh         1      3     2
#> 4 Josh         0      3     2
#> 5 Ana          0      4     3
#> 6 Ana          1      4     3

AND just to show other options under distinct

library(dplyr)


df %>% 
  distinct(Name, Response, .keep_all = TRUE)
#> # A tibble: 6 x 4
#>   Name  Response Obs_No  Freq
#>   <chr>    <dbl>  <dbl> <dbl>
#> 1 Ben          1      1     1
#> 2 Marie        0      2     1
#> 3 Josh         1      3     2
#> 4 Josh         0      3     2
#> 5 Ana          0      4     3
#> 6 Ana          1      4     3

df %>% 
  distinct(Name, Obs_No, .keep_all = TRUE)
#> # A tibble: 4 x 4
#>   Name  Response Obs_No  Freq
#>   <chr>    <dbl>  <dbl> <dbl>
#> 1 Ben          1      1     1
#> 2 Marie        0      2     1
#> 3 Josh         1      3     2
#> 4 Ana          0      4     3

^{Created on 2021-04-27 by the reprex package (v2.0.0)}

system · May 18, 2021, 10:23pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.