Replace values of a variable by the most frequent value

Mos_Taf · July 17, 2018, 1:55pm

ID <- c("A", "A", "A", "B", "B", "c")
Value <- c("blue", "blue", "green", "red", "orange", NA)
  
df <- tibble(ID, Value)

I have to group by the ID, keep the value of Value which is the most repeated
If the values are equal (ID == "B"), I choose the first value

The value variable should look like :

Value_output <- c("blue", "blue", "blue", "red", "red", NA)

Thanks for help

cderv · July 17, 2018, 8:22pm

It is surely not the shortest solution but it is one approach that works: Create a table with the value you want for each ID then join to the first table

ID <- c("A", "A", "A", "B", "B", "c")
Value <- c("blue", "blue", "green", "red", "orange", NA)

library(tidyverse)
df <- tibble(ID, Value)
# create a replacement table
replacement <- df %>%
  # create ordered factor as we want first after
  mutate(Value = fct_inorder(Value)) %>%
  # group_by all columns
  group_by_all() %>%
  # count the occurrence in last column and sort
  tally(sort = TRUE) %>%
  # keep only the first row in each remaining group
  # (the max because sorted)
  slice(1) %>%
  ungroup() %>%
  select(ID, Value_output = Value)
#> Warning: le package 'bindrcpp' a été compilé avec la version R 3.4.4

df %>% 
  inner_join(replacement)
#> Joining, by = "ID"
#> # A tibble: 6 x 3
#>   ID    Value  Value_output
#>   <chr> <chr>  <fct>       
#> 1 A     blue   blue        
#> 2 A     blue   blue        
#> 3 A     green  blue        
#> 4 B     red    red         
#> 5 B     orange red         
#> 6 c     <NA>   <NA>

Created on 2018-07-17 by the reprex package (v0.2.0).

Sata · September 17, 2018, 3:23pm

This was very helpful for me as I had a very similar question... Could you maybe also give me a hint on how to do it if I have to group by ID and year? Thank you for your help.

mara · September 17, 2018, 3:27pm

Are you asking about if you have two grouping variables, ID and year? Or are you asking if you have the same scenario as above, except the two variables are ID and year (where you'd want the most frequent year to be the replacement value)?

Sata · September 17, 2018, 3:30pm

I am referring to two grouping variables!

mara · September 17, 2018, 3:52pm

It's the same deal, just with another grouping variable:

ID <- c("A", "A", "A", "B", "B", "c")
year <- c("1996", "1996", "1995", "1995", "1995", "1996")
Value <- c("blue", "blue", "green", "red", "orange", "blue")

library(tidyverse)
df <- tibble(ID, year, Value)
# create a replacement table
replacement <- df %>%
  mutate(Value = fct_inorder(Value)) %>%
  group_by_all() %>%
  tally(sort = TRUE) %>%
  slice(1) %>%
  ungroup() %>%
  select(ID, year, Value_output = Value)

replacement
#> # A tibble: 4 x 3
#>   ID    year  Value_output
#>   <chr> <chr> <fct>       
#> 1 A     1995  green       
#> 2 A     1996  blue        
#> 3 B     1995  red         
#> 4 c     1996  blue

^{Created on 2018-09-17 by the reprex package (v0.2.1.9000)}

Mos_Taf · September 24, 2018, 8:50am

Thanks for reply,

another way to do it :

library(dplyr)
df %>%
   group_by(ID) %>%
   arrange(ID, is.na(Value)) %>% # in case to keep non- NA elements for a tie
   mutate(Value_output = Mode(Value))

where

 Mode <- function(x) {
   ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
 }