Means for duplicates with specific exclusions

Hi, I have this simple data frame with some duplicates:

data.source <- data.frame(
  stringsAsFactors = FALSE,
             RegNo = c("ccc","ccc","ddd","ddd",
                       "aaa","aaa","bbb","bbb","ddd","ddd"),
           RegDate = c("2013-10-26 00:00:00",
                       "2013-10-26 00:00:00","2013-04-30 00:00:00",
                       "2013-04-30 00:00:00","2007-05-21 00:00:00","2007-05-21 00:00:00",
                       "2015-03-30 00:00:00","2015-03-30 00:00:00",
                       "2013-04-30 00:00:00","2013-04-30 00:00:00"),
           Mileage = c(37230,37230,38163,38163,
                       68050,68050,32276,32276,38163,38165),
                CWI = c("C", "C", "C", "C", "W", "C", "C", "W", "W", "C"),
           RoTotal = c(725.75,725.75,178.71,178.71,
                       387.66,482.9,239.63,21.43,68.79,322.64),
        LabourCost = c(243.19,205.19,110.77,101.37,
                       157.14,245.74,140.6,21.43,57.14,50),
         PartsCost = c(401.34,306.01,19.77,33.34,
                       230.52,160.16,42.77,0,11.65,272.64),
  LastWorkshopDate = c("2018-10-19","2018-10-19",
                       "2018-04-26","2018-04-26","2018-06-04","2018-06-04",
                       "2018-03-22","2018-03-22","2018-04-26","2018-06-25"),
        SubletCost = c(20, 0, 0, 0, 0, 10, 25, 0, 0, 0)
)
data.source

All variables apart from LabourCost, PartsCost and SubletCost are identical for first two and the second two rows.

I know how to keep unique records only but ...
I need to keep unique records if RegNo RegDate, Mileage, LastWorkshopDate, CWI and RoTotal are the same BUT to have average for three numerical variables which are different. In other words, I want to replace values for duplicated rows 1-2 and 3-4 by means of LabourCost, PartsCost and SubletCost.

I also want the same but with max value for the above (instead of the average).

How can I do that?

I think you just need to group by the variables that define the repeated rows.

library(dplyr)
data.source |> group_by(RegNo, RegDate, Mileage, LastWorkshopDate, 
                        CWI, RoTotal) |>
  summarize(LabourCost = mean(LabourCost),
            PartsCost = mean(PartsCost),
            SubletCost = mean(SubletCost))

grpnames <- c(
  "RegNo", "RegDate", "Mileage",
  "LastWorkshopDate", "CWI", "RoTotal"
)

group_by_at(data.source, grpnames) |>
  mutate(across(
    .fns = list(mean = mean, max = max)
  ))

but I might not have understood your requirements

library(dplyr)
data.source |> group_by(RegNo, RegDate, Mileage, LastWorkshopDate, 
                        CWI, RoTotal) |>
  summarize(LabourCost = mean(LabourCost),
            PartsCost = mean(PartsCost),
            SubletCost = mean(SubletCost))

Excellent!
Is it possible to select somehow all variables apart from LabourCost, PartsCost and SubletCost in the group_by?

Thinking about it again, I can see that average of all numerical variables can be also calculated as mean of two identical values is the same as each of them. It means we could group by all non-numerical values and calculate means for all numerical ones. Can we do that?

Very interesting! Sorry if I was not clear. I wanted to remove two duplicated records keeping means scores so in the end I should have 8 rows of data

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.