Remove duplicated rows without affecting other variables

Hi,

I was wondering if anyone would be able to help me in trying to remove duplicated values of a variable I have without affecting/removing values of other variables. I've tried using distinct() but to no avail as it shortens the dataframe by affecting all variables.

Attached the code and my ideal output below, thanks for any help.

library(tidyverse)
library(lubridate)

#Variables
patientid <- c("-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", 
               "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", 
               "-2147483646", "-2147483646", "-2147483646", "-2147483646")

date <- c("2018-08-06", "2018-08-07", "2018-08-15", "2018-08-20", "2018-08-27", "2018-09-03", "2018-09-10",
          "2018-09-17", "2018-09-24", "2018-10-01", "2018-10-08", "2018-10-15", "2018-10-22", "2018-10-29",
          "2018-11-05", "2018-11-12")

week <- week(date)

adherence <- c(4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)  

#Sample dataframe
test.df <- data.frame(patientid, date, week, adherence)

#Ideal output
patientid date week adherence count
n         dmy  32   4         4
n         dmy  32   4         NA
n         dmy  33   3         3
n         dmy  33   3         NA

This will keep only the first instance of each unique adherence value:

test.df %>% 
  group_by(patientid) %>% 
  mutate(count = ifelse(duplicated(adherence), NA_real_, adherence))

The next block of code will keep the first instance of each run whenever the value of adherence changes. For example, if adherence later returns to a value of 4, the start of the second run of 4s would also be kept. You can test this by changing adherence to: adherence <- c(4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4)

test.df %>% 
  group_by(patientid) %>% 
  mutate(count = ifelse(c(TRUE, diff(adherence) != 0), adherence, NA_real_))
2 Likes

Worked perfectly! Thanks for the help, much appreciated.

Update, it has worked for the most part how I wanted it to.

I've found values being removed by the code because they share the same week number, despite one being in 2018 and one 2019 (want to keep them in this case).

For this I was thinking of adding a new variable called year, which I would group to week in order to stop that from happening. Either that or find a way to assign custom week numbers so that it goes from 1-n, not accounting for the year the week is in.

It sounds like grouping by year and patientid would work. You can create a year column with mutate(year = year(ymd(date))). year and ymd are functions from lubridate. ymd is necessary only if date is not already of class Date in your real data.

1 Like

Worked as expected, thanks again for the help!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.