Mode (most common value) function ignoring NA and returning largest value in the event of a tie

I am still a bit surprised there is no general mode function in R.
But in fact my requirements are more specific. I want:

  1. Most Common Value
  2. Work any data type: but specifically numbers, character, date and posix date
  3. Ignore NA
  4. Return largest value in the event of a tie

A few questions based on the code if I may:

  1. Is there an off the shelf function I can use
  2. I have supplied my current working function MostCommon. But it is quite slow. Can any suggest improvements
  3. I have shown an example using the textbook mode function which uses tabulate. tabulate says it only works for a numeric vector. But seems to work for characters and dates. Is there room to use it or something like it?
library("plyr")
library("dplyr")

# test for NA or empty string
is.empty = function (x) {
  classx = class(x)
  
  if(length(classx) > 1) {
    is.na(x)
  } else if (classx == "character") {
    if_else (is.na(x) , TRUE,  nchar(x) == 0)
  }
  else {
    is.na(x)
  }
}

# MostCommon: get mode.
# in event of tie, return largest value with the highest frequency. ignore NA

MostCommon <- function(VectorIn) {
  
  # if vector is only length 1 then return it  
  if (length(VectorIn) == 1) {
    returnval = VectorIn[1]
  } else {
    uniquevals = unique(VectorIn)
    if (length(uniquevals) == 1) {
      returnval <- uniquevals[1]
    } else {
      counts <- sapply(uniquevals, function(x) {sum(VectorIn == x, na.rm = TRUE)})
      freqtable <- data.frame(value = uniquevals, counts, stringsAsFactors = FALSE, row.names = NULL) %>% arrange(desc(counts), desc(value))
      freqtableminusempty <- freqtable %>% filter(!is.empty(value)) # take away empties
      if (nrow(freqtableminusempty) >= 1) {   # return highest non-empty
        returnval <- freqtableminusempty[1,] %>% pull(value)
      } else {
        returnval <- freqtable[1,] %>% pull(value) #return highest empty
      }
    }
  }
  returnval
}

# just like !%in% but easier to read
`%not in%` <- function (x, table) is.na(match(x, table, nomatch=NA_integer_))

# standard hardcoded specification of column name
speciescounts <- starwars %>% 
  group_by(species ) %>% 
  summarise(count = n()) %>%
  arrange(desc(count))
print(speciescounts)

cat("MostCommon All: ", MostCommon(starwars$species),"\n")

starwarsSpeciesCount2OrBelow <- starwars %>%
  filter(species %not in% c("Human", "Droid", "Gungan"))

cat("MostCommon excluding freq > 2: ", MostCommon(starwarsSpeciesCount2OrBelow$species),"\n")

# use the usual contributed solution for mode - adding na.rm option
modefunction <- function(x, na.rm = FALSE) {
  ux <- unique(x)
  if (na.rm) {
    ux <- ux[which(!is.na(ux))]
  }
  tab <- tabulate(match(x, ux))
  ux[which.max(tab)]
}

cat("modefunction: ", modefunction(starwarsSpeciesCount2OrBelow$species,na.rm = TRUE),"\n")

DateVector <- as.Date(c("2020/12/1", "2020/12/1", "2020/12/2","2020/12/2"))
print(modefunction(DateVector,na.rm = TRUE))
cat("modefunction Date: ", as.character( as.Date(modefunction(DateVector,na.rm = TRUE))),"\n")

DateVectorPosix <- as.POSIXct.Date(DateVector)
print(modefunction(DateVectorPosix,na.rm = TRUE))

Here is a version that uses table() and max(), so it only works for types where max() is valid (but that includes characters and Dates). It does seem to be 30x faster than MostCommon().

MostCommon2 <- function(x){
  tab <- table(x)
  candidates <- names(tab)[tab == max(tab)]
  max(candidates)
}

bench::mark(
  MostCommon = MostCommon(starwarsSpeciesCount2OrBelow$species),
  MostCommon2= MostCommon2(starwarsSpeciesCount2OrBelow$species)
)
# A tibble: 2 x 13
#   expression       min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#   <bch:expr>  <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 MostCommon    4.53ms   4.9ms      199.    24.2KB     4.15    96     2      482ms
# 2 MostCommon2  150.3us 162.2us     5750.    8.77KB     6.37  2707     3      471ms
#  ... with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
1 Like

Thank you so much.

With other stuff going on I had never got around to improving what I knew was a "ragged" bit of code.
I did hit issues with your version: integers were returned as character (because of the use of names)
But I took your idea and used the underlying code to table (tabulate) and developed on that.
It think it might go yet faster based on a quick test. My whole code is sped up by factor 2/

Happy to add some notes on this if it helps anyone, or fix bugs
It yields the same results as my old version where it is used a lot within summarise()

MostCommon <- function(x) {
  ux <- unique(x)
  uxnotna <- ux[which(!is.na(ux))]
  if(length(uxnotna) > 0) {
    tab <- tabulate(match(x, uxnotna))
    candidates = uxnotna[tab == max(tab)]
    if (class(x)[1]  == "logical") {
      any(candidates) # return TRUE if any true. max returns an integer
    } else {
      max(candidates) # return highest (ie max) value
    }
  } else {
    ux   # this returns the NA with the right class. ie that of x
  }
}
1 Like

i believe it is generally preferred to use is.logical(x) instead of the above, as it doesnt rely on the class listing order for potentially multiclass vectors (as unlikely as it may be), its also easier to read/type

1 Like

That is much more robust. I agree.
I had not come across the idea of this being preferred.

This was only my second time using this forum and I have had very helpful responses indeed.

Code is simpler now too:

MostCommonPrecendenceLow <- function(x) {
  ux <- unique(x)
  uxnotna <- ux[which(!is.na(ux))]
  if(length(uxnotna) > 0) {
    tab <- tabulate(match(x, uxnotna))
    candidates = uxnotna[tab == max(tab)]
    if (is.logical(x)) {
      all(candidates) # only return TRUE if all true. max returns an integer
    } else {
      min(candidates) # return highest (ie max) value
    }
  } else {
    ux   # this returns the NA with the right class. ie that of x
  }
}

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.