De-duplicating but keeping some names and averaging numerical

Hi, I have this sample file with errors in Model name.

source <- data.frame(
   stringsAsFactors = FALSE,
         DealerName = c("aaa","aaa","aaa","aaa",
                        "aaa","aaa","aaa","aaa","aaa","aaa"),
             VSCode = c(8007,8007,8007,8007,8007,
                        8007,8007,8007,8007,8007),
      AccountNumber = c(109, 109, 109, 109, 109, 109, 109, 109, 109, 109),
              Model = c("aaa","aaa2","aaa3","fff",
                        "fff2","bbb","bbb","ccc","ccc","ddd"),
              RegNo = c("x32PF","x32PF","x32PF",
                        "z3PFA","z3PFA","xxx","xxx","yyy","yyy","zzz"),
            RegDate = c("2021-03-05","2021-03-05",
                        "2021-03-05","2021-06-02","2021-06-02","2015-11-20",
                        "2015-11-20","2017-03-03","2017-03-03","2003-03-28"),
            Mileage = c(3645,3645,3645,11164,
                        15907,15571,15571,20376,20376,35574),
  PreviousVisitDate = c(NA,NA,NA,NA,"2022-05-11",
                        NA,NA,"2021-03-04","2021-03-04",NA),
   LastWorkshopDate = c("2021-11-05","2021-11-05",
                        "2021-11-05","2022-05-11","2022-10-24","2018-01-24",
                        "2018-01-24","2022-03-03","2022-03-04","2021-05-28"),
                CWI = c("W", "W", "W", "C", "C", "C", "C", "C", "C", "W"),
            RoTotal = c(293.15,293.15,293.15,162.6,
                        171.49,435.68,435.68,210.25,33.29,82.44),
         LabourCost = c(11,30.24,30.24,94.5,51.28,
                        294.99,294.99,137.29,0,37.42),
          PartsCost = c(50,100,0,15.6,120.21,
                        140.69,140.69,20.46,0,45.02),
         SubletCost = c(100, 0, 0, 0, 0, 0, 0, 0, 24.99, 0),
            MOTFlag = c(FALSE,FALSE,FALSE,FALSE,
                        FALSE,FALSE,FALSE,TRUE,FALSE,FALSE)
)

Is it possible to remove duplicates this way:

  1. Group by RegNo
  2. If all non-numerical values apart from Model are the same
    a) keep the first model name
    b) keep all other non-numerical ones - because they are the same
    c) calculate means for numerical values
    In this example aaa2 and aaa3 should be removed and aaa details with averages for numerical values should be kept, one bbb should be removed and averages for numerical values should be kept

Is it easy to do without specifying to many variables in the code?

I believe the following returns the desired output.

library(tidyverse)

# vector of character columns (minus Model)
chr_cols = source %>% select_if(is.character) %>% names()
chr_cols = chr_cols[chr_cols != 'Model']

out = source %>%
  # get averages by RegNo
  group_by(RegNo) %>%
  mutate(across(is.numeric, mean)) %>%
  ungroup() %>%
  # arrange by Model and keep the first
  arrange(Model) %>%
  distinct_at(c(chr_cols), .keep_all = T)
  
out
#> # A tibble: 7 × 15
#>   DealerName VSCode AccountN…¹ Model RegNo RegDate Mileage Previ…² LastW…³ CWI  
#>   <chr>       <dbl>      <dbl> <chr> <chr> <chr>     <dbl> <chr>   <chr>   <chr>
#> 1 aaa          8007        109 aaa   x32PF 2021-0…   3645  <NA>    2021-1… W    
#> 2 aaa          8007        109 bbb   xxx   2015-1…  15571  <NA>    2018-0… C    
#> 3 aaa          8007        109 ccc   yyy   2017-0…  20376  2021-0… 2022-0… C    
#> 4 aaa          8007        109 ccc   yyy   2017-0…  20376  2021-0… 2022-0… C    
#> 5 aaa          8007        109 ddd   zzz   2003-0…  35574  <NA>    2021-0… W    
#> 6 aaa          8007        109 fff   z3PFA 2021-0…  13536. <NA>    2022-0… C    
#> 7 aaa          8007        109 fff2  z3PFA 2021-0…  13536. 2022-0… 2022-1… C    
#> # … with 5 more variables: RoTotal <dbl>, LabourCost <dbl>, PartsCost <dbl>,
#> #   SubletCost <dbl>, MOTFlag <lgl>, and abbreviated variable names
#> #   ¹​AccountNumber, ²​PreviousVisitDate, ³​LastWorkshopDate

Created on 2023-01-14 with reprex v2.0.2.9000

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.