Deduplication with some conditions

Hi,
I have a challenge with deduplication.
I have a file with four respondents aaa, bbb, ccc and ddd.

  1. Respondent aaa has all variables the same apart from two text fields (SERVICE_DESCRIPTION, BOOKING_NOTES).
  2. Respondent bbb has almost all variables the same apart from numerical values
  3. Respondent ccc has two identical records
  4. Respondent ddd has only one response

This is the file:

data.source <- data.frame(
     stringsAsFactors = FALSE,
                  URN = c("aaa", "aaa", "bbb", "bbb", "ccc", "ccc", "ddd"),
     RegistrationDate = c("2016-07-20",
                          "2016-07-20","2014-12-31","2014-12-31","2017-12-28",
                          "2017-12-28","2018-12-28"),
        InterviewDate = c("2019-01-03 08:31:00",
                          "2019-01-03 08:31:00","2019-01-05 20:04:00",
                          "2019-01-05 20:04:00","2019-01-11 19:59:00",
                          "2019-01-11 19:59:00","2019-01-11 19:59:00"),
  SERVICE_DESCRIPTION = c("Carry out  something",
                          "Check for uneven wear","bla bla","bla bla","clock",
                          "clock","xxx"),
        BOOKING_NOTES = c("detail 1",
                          "wear at rear","parts","parts","issue","issue","yyy"),
     Duration_Minutes = c(0, 0, 132, 132, 0, 0, 0),
       JOB_START_TIME = c("1900-01-01 00:00:00",
                          "1900-01-01 00:00:00","2018-11-22 11:00:00",
                          "2018-12-17 11:00:00","1900-01-01 00:00:00",
                          "1900-01-01 00:00:00","2018-01-01 00:00:00"),
         JOB_END_TIME = c("1900-01-01 00:00:00",
                          "1900-01-01 00:00:00","2018-11-22 15:30:00",
                          "2018-12-17 15:30:00","1900-01-01 00:00:00",
                          "1900-01-01 00:00:00","2018-01-01 00:00:00"),
                Model = c("AAA", "AAA", "BBB", "BBB", "CCC", "CCC", "CCC"),
              Mileage = c(35722, 35722, 15399, 15399, 8037, 8037, 8037),
                 YEAR = c(2016, 2016, 2014, 2014, 2017, 2017, 2017),
              ROTotal = c(503.35, 503.35, 164.91, 0, 189.88, 189.88, 189.88),
           LabourSale = c(203.5, 203.5, 116.6, 0, 114.3, 114.3, 114.3)
)

Now, I would like to remove all duplicates with the following conditions:

  1. If just numerical variables are different (bbb), use averages of these values
  2. If comment variables are different (SERVICE_DESCRIPTION, BOOKING_NOTES) and numerical values are different (aaa) , merge texts from comment variables and calculate average score from numerical variables
  3. If everything is the same (ccc) , keep one record
  4. Keep other records unchanged (ddd)

I've done this so far:

library(tidyverse)
data.dedup <- data.source %>% 
  group_by(URN,	RegistrationDate,	InterviewDate,	SERVICE_DESCRIPTION,	BOOKING_NOTES,	JOB_START_TIME,	JOB_END_TIME,	Model) %>% 
  summarise(Duration_Minutes = mean(Duration_Minutes), Mileage = mean(Mileage), YEAR = mean(YEAR), ROTotal = mean(ROTotal), LabourSale = mean(LabourSale))


library(tidyverse)
data.dedup2 <- data.source %>% 
  group_by(URN,	RegistrationDate,	InterviewDate,	JOB_START_TIME,	JOB_END_TIME,	Model) %>% 
  summarise(Duration_Minutes = mean(Duration_Minutes), Mileage = mean(Mileage), YEAR = mean(YEAR), ROTotal = mean(ROTotal), LabourSale = mean(LabourSale))

but I need a complex solution.
As a result, I need four rows of data.

Can you help?

I think your problem becomes simpler if you only group by the URN, then apply a function to each sub-data.frame (for each URN value). Something along these lines:

collapse_entries <- function(subdf){
  subdf %>%
    summarize(across(where(is.numeric),
                     mean),
              across(4:5, ~paste(unique(.x), collapse="/")))
}

data.source %>%
  group_by(URN) %>%
  nest() %>%
  mutate(collapsed_data = map(data, collapse_entries)) %>%
  select(-data) %>%
  unnest(collapsed_data)

Does that help?

Alright, then you just need to remove the columns that have been deduplicated, since they haven't been deduplicated in the data.source, the merge would keep all rows.

source.no.dup <- select(data.source,
             -c(where(is.numeric),"SERVICE_DESCRIPTION","BOOKING_NOTES","RegistrationDate","InterviewDate", "JOB_START_TIME","JOB_END_TIME")) %>%
  distinct()

merge(dedup.data, source.no.dup,
      by="URN", all=FALSE)

Note that I also had to remove JOB_START_TIME and JOB_END_TIME since they differ between the two bbbrows.

Thank you very much master!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

Oh, no that was a typo from me: I first tried the summarize on a filtered df (so, with the URN column), then copy-pasted in the function; but of course the subdf in the function doesn't have the URN column. One clear reason not to use column numbers to select columns!

collapse_entries <- function(subdf){
  subdf %>%
    summarize(across(where(is.numeric),
                     mean),
              across(3:4, ~paste(unique(.x), collapse="/")))
}

That's one possibility, another is to keep them in the summary(), I just wasn't sure I understood exactly what you where trying to do. So this should keep RegistrationDate without modification:

collapse_entries <- function(subdf){
  subdf %>%
    summarize(across(where(is.numeric),
                     mean),
              across(3:4, ~paste(unique(.x), collapse="/")),
              RegistrationDate)
}

Of course, you can use across("RegistrationDate", unique) or something like that to collapse the identical dates.

1 Like

Wow! Excellent, although I need to keep remaining variables as all of them will be used later. Perhaps I could merge the result of your manipulation with the original file?
Also, how can I repeat merging BOOKING_NOTES (beautifully done) to the SERVICE_DESCRIPTION (not done yet)? Should across be related to the 4th and 5th variable? Only the 4th was merged...

I don't understand what you're trying to do with the merge? Isn't dedup.data the result you are aiming for?

Thank you very much.
I can see that using this code in my large data frame takes ages:

collapse_entries <- function(subdf){
  subdf %>%
    summarize(across(where(is.numeric),
                     mean),
              across(3:4, ~paste(unique(.x), collapse="/")),
              across("RegistrationDate", unique),
              across("InterviewDate", unique),
    )
}
dedup.data <- data.source %>%
  group_by(URN) %>%
  nest() %>%
  mutate(collapsed_data = map(data, collapse_entries)) %>%
  select(-data) %>%
  unnest(collapsed_data)

I think that adding missing variables by merging should be much quicker. Nevertheless, whichever type of merging I'm using I always have all 7 records rather than just 4. I have tried all options but the result is always the same: 7 rows :frowning:


collapse_entries <- function(subdf){
  subdf %>%
    summarize(across(where(is.numeric),
                     mean),
              across(3:4, ~paste(unique(.x), collapse="/"))
                            )
}
dedup.data <- data.source %>%
  group_by(URN) %>%
  nest() %>%
  mutate(collapsed_data = map(data, collapse_entries)) %>%
  select(-data) %>%
  unnest(collapsed_data)

merged.data <- merge(data.source, dedup.data, by="URN",all = FALSE,
                     sort = TRUE, no.dups = TRUE)
merged.data

What am I doing wrong?

Yes, I need dedup.data but including all variables.
I think there are two ways but merging dedup.data with the original data could add missing variables easily...