Remove duplicated rows without affecting other variables

latsyrhccm · April 24, 2020, 9:58pm

Hi,

I was wondering if anyone would be able to help me in trying to remove duplicated values of a variable I have without affecting/removing values of other variables. I've tried using distinct() but to no avail as it shortens the dataframe by affecting all variables.

Attached the code and my ideal output below, thanks for any help.

library(tidyverse)
library(lubridate)

#Variables
patientid <- c("-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", 
               "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", 
               "-2147483646", "-2147483646", "-2147483646", "-2147483646")

date <- c("2018-08-06", "2018-08-07", "2018-08-15", "2018-08-20", "2018-08-27", "2018-09-03", "2018-09-10",
          "2018-09-17", "2018-09-24", "2018-10-01", "2018-10-08", "2018-10-15", "2018-10-22", "2018-10-29",
          "2018-11-05", "2018-11-12")

week <- week(date)

adherence <- c(4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)  

#Sample dataframe
test.df <- data.frame(patientid, date, week, adherence)

#Ideal output
patientid date week adherence count
n         dmy  32   4         4
n         dmy  32   4         NA
n         dmy  33   3         3
n         dmy  33   3         NA

joels · April 25, 2020, 12:14am

This will keep only the first instance of each unique adherence value:

test.df %>% 
  group_by(patientid) %>% 
  mutate(count = ifelse(duplicated(adherence), NA_real_, adherence))

The next block of code will keep the first instance of each run whenever the value of adherence changes. For example, if adherence later returns to a value of 4, the start of the second run of 4s would also be kept. You can test this by changing adherence to: adherence <- c(4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4)

test.df %>% 
  group_by(patientid) %>% 
  mutate(count = ifelse(c(TRUE, diff(adherence) != 0), adherence, NA_real_))

latsyrhccm · April 25, 2020, 8:21pm

Worked perfectly! Thanks for the help, much appreciated.

latsyrhccm · April 25, 2020, 8:33pm

Update, it has worked for the most part how I wanted it to.

I've found values being removed by the code because they share the same week number, despite one being in 2018 and one 2019 (want to keep them in this case).

For this I was thinking of adding a new variable called year, which I would group to week in order to stop that from happening. Either that or find a way to assign custom week numbers so that it goes from 1-n, not accounting for the year the week is in.

joels · April 25, 2020, 9:50pm

It sounds like grouping by year and patientid would work. You can create a year column with mutate(year = year(ymd(date))). year and ymd are functions from lubridate. ymd is necessary only if date is not already of class Date in your real data.

latsyrhccm · April 26, 2020, 2:59pm

Worked as expected, thanks again for the help!

system · May 3, 2020, 2:59pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.