Replace values of a variable by the most frequent value

ID <- c("A", "A", "A", "B", "B", "c")
Value <- c("blue", "blue", "green", "red", "orange", NA)
  
df <- tibble(ID, Value)

I have to group by the ID, keep the value of Value which is the most repeated
If the values are equal (ID == "B"), I choose the first value

The value variable should look like :

Value_output <- c("blue", "blue", "blue", "red", "red", NA)

Thanks for help

It is surely not the shortest solution but it is one approach that works: Create a table with the value you want for each ID then join to the first table

ID <- c("A", "A", "A", "B", "B", "c")
Value <- c("blue", "blue", "green", "red", "orange", NA)

library(tidyverse)
df <- tibble(ID, Value)
# create a replacement table
replacement <- df %>%
  # create ordered factor as we want first after
  mutate(Value = fct_inorder(Value)) %>%
  # group_by all columns
  group_by_all() %>%
  # count the occurrence in last column and sort
  tally(sort = TRUE) %>%
  # keep only the first row in each remaining group
  # (the max because sorted)
  slice(1) %>%
  ungroup() %>%
  select(ID, Value_output = Value)
#> Warning: le package 'bindrcpp' a été compilé avec la version R 3.4.4

df %>% 
  inner_join(replacement)
#> Joining, by = "ID"
#> # A tibble: 6 x 3
#>   ID    Value  Value_output
#>   <chr> <chr>  <fct>       
#> 1 A     blue   blue        
#> 2 A     blue   blue        
#> 3 A     green  blue        
#> 4 B     red    red         
#> 5 B     orange red         
#> 6 c     <NA>   <NA>

Created on 2018-07-17 by the reprex package (v0.2.0).

2 Likes

This was very helpful for me as I had a very similar question... Could you maybe also give me a hint on how to do it if I have to group by ID and year? Thank you for your help.

Are you asking about if you have two grouping variables, ID and year? Or are you asking if you have the same scenario as above, except the two variables are ID and year (where you'd want the most frequent year to be the replacement value)?

I am referring to two grouping variables!

It's the same deal, just with another grouping variable:

ID <- c("A", "A", "A", "B", "B", "c")
year <- c("1996", "1996", "1995", "1995", "1995", "1996")
Value <- c("blue", "blue", "green", "red", "orange", "blue")

library(tidyverse)
df <- tibble(ID, year, Value)
# create a replacement table
replacement <- df %>%
  mutate(Value = fct_inorder(Value)) %>%
  group_by_all() %>%
  tally(sort = TRUE) %>%
  slice(1) %>%
  ungroup() %>%
  select(ID, year, Value_output = Value)

replacement
#> # A tibble: 4 x 3
#>   ID    year  Value_output
#>   <chr> <chr> <fct>       
#> 1 A     1995  green       
#> 2 A     1996  blue        
#> 3 B     1995  red         
#> 4 c     1996  blue

Created on 2018-09-17 by the reprex package (v0.2.1.9000)

3 Likes

Thanks for reply,

another way to do it :

library(dplyr)
df %>%
   group_by(ID) %>%
   arrange(ID, is.na(Value)) %>% # in case to keep non- NA elements for a tie
   mutate(Value_output = Mode(Value))

where

 Mode <- function(x) {
   ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
 }
1 Like