Duplicate Value Check

Thandohlove · September 12, 2020, 8:13pm

Hello,
I am new to R and I am trying to pick-up Lake Water level duplicate values from the data below, but the response returned from running :duplicate() is all false. This is not correct because i can see some of the data frame.

      Date A01 A02 A03 A032 A01_CD A02_CD A03_CD A032_CD

1 1966/05/07 4.9 NA NA NA 4.9 NA NA NA
2 1966/05/08 4.9 NA NA NA NA NA NA NA
3 1966/05/09 4.9 NA NA NA 4.9 NA NA NA
4 1966/05/10 4.9 NA NA NA NA NA NA NA
5 1966/05/11 4.8 NA NA NA NA NA NA NA
6 1966/05/12 4.9 NA NA NA NA NA NA NA
7 1966/05/13 4.8 NA NA NA NA NA NA NA
8 1966/05/15 4.8 NA NA NA NA NA NA NA
9 1966/05/16 4.8 NA NA NA 4.8 NA NA NA
10 1966/05/17 4.8 NA NA NA NA NA NA NA
11 1966/05/19 4.8 NA NA NA NA NA NA NA
12 1966/05/20 4.8 NA NA NA NA NA NA NA
13 1966/05/21 4.8 NA NA NA NA NA NA NA
14 1966/05/23 4.8 NA NA NA 4.8 NA NA NA

Next I tried to remove the duplicates using: No_Duplicates = distinct(dataset9, A01, A02, A03, A032, A01_CD, A02_CD, A03_CD, A032_CD, .keep_all= TRUE), but this removes entire rows and i just wanted to removed the duplicate value in the columns. Please help me.

elmstedt · September 12, 2020, 8:52pm

I'm unclear what you mean when you say,

After you remove a duplicate value, what do you want to be there? NA? 0? What?

You typically cannot just remove a value as that would create columns of different lengths making the data no longer rectangular and invalid to be a data.frame object.

So, if you had a data frame that looked like,

What do you imagine the output will be after the process you want to enact?

Thandohlove · September 12, 2020, 9:17pm

A B C
1 1 2 3
2 2 NA 4
3 4 5 6
4 2 NA 4
I would want to replace one of the duplicates with NA, especially in the CD columns.

Thandohlove · September 12, 2020, 9:19pm

1 1966/05/07 4.9 NA NA NA NA NA NA NA
2 1966/05/08 4.9 NA NA NA NA NA NA NA
3 1966/05/09 4.9 NA NA NA NA NA NA NA
4 1966/05/10 4.9 NA NA NA NA NA NA NA
5 1966/05/11 4.8 NA NA NA NA NA NA NA
6 1966/05/12 4.9 NA NA NA NA NA NA NA
7 1966/05/13 4.8 NA NA NA NA NA NA NA
8 1966/05/15 4.8 NA NA NA NA NA NA NA
9 1966/05/16 4.8 NA NA NA NA NA NA NA
10 1966/05/17 4.8 NA NA NA NA NA NA NA
11 1966/05/19 4.8 NA NA NA NA NA NA NA
12 1966/05/20 4.8 NA NA NA NA NA NA NA
13 1966/05/21 4.8 NA NA NA NA NA NA NA
14 1966/05/23 4.8 NA NA NA NA NA NA NA
This would be the outcome of the code I would run.

nirgrahamuk · September 12, 2020, 10:46pm

seems like you treat each column independently and want to keep only the first unique values.

library(tidyverse)
set.seed(42)

(d<- data.frame(a=sample.int(5,10,replace=TRUE),
           b=sample.int(5,10,replace=TRUE),
           c=sample.int(5,10,replace=TRUE),
           d=sample.int(5,10,replace=TRUE)))
# a b c d
# 1  1 1 5 3
# 2  5 5 5 2
# 3  1 4 5 4
# 4  1 2 4 4
# 5  2 2 2 2
# 6  4 3 4 5
# 7  2 1 3 4
# 8  2 1 2 5
# 9  1 3 1 4
# 10 4 4 2 2

cleanvecofdups <- function(vec){
 df <- enframe(vec,name=NULL,value="v") %>% 
   group_by_all() %>% 
   mutate(rn = row_number()) 

 df$v2 = ifelse(df$rn==1,df$v,NA)
 
}


altered_cd <- purrr::map_dfc(d %>% select(c,d),
           cleanvecofdups)

(d2 <- bind_cols(d %>% select(a,b), altered_cd)) 
# a b  c  d
# 1  1 1  5  3
# 2  5 5 NA  2
# 3  1 4 NA  4
# 4  1 2  4 NA
# 5  2 2  2 NA
# 6  4 3 NA  5
# 7  2 1  3 NA
# 8  2 1 NA NA
# 9  1 3  1 NA
# 10 4 4 NA NA

system · October 3, 2020, 10:46pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.