Identifying and removing outliers from study data

Longshot408 · January 14, 2020, 8:25pm

Hello all,

I'm trying to figure out a way to remove outliers from my study based on completion time. Specifically, those who rushed through it too quickly, or who sat there forever and weren't putting in effort. I think the code for the Double MAD on this page is perfect for this, since completion times are not normally distributed and seriously skewed right. For example, the median completion time in my study is about 432 seconds, but some people have over 10,000 seconds (166 minutes) as their total time. This study shouldn't take any more than 12-15.

However, the code provided by the author (pasted below) simply prints the outliers identified instead of removing them. Given that I have 35 outliers in my data, I don't want to have to sort through it and remove them one by one. Does anyone know how to tweak this to get it to remove those identified by the function my my data set?

DoubleMAD <- function(x, zero.mad.action="warn"){
   # The zero.mad.action determines the action in the event of an MAD of zero.
   # Possible values: "stop", "warn", "na" and "warn and na".
   x         <- x[!is.na(x)]
   m         <- median(x)
   abs.dev   <- abs(x - m)
   left.mad  <- median(abs.dev[x<=m])
   right.mad <- median(abs.dev[x>=m])
   if (left.mad == 0 || right.mad == 0){
      if (zero.mad.action == "stop") stop("MAD is 0")
      if (zero.mad.action %in% c("warn", "warn and na")) warning("MAD is 0")
      if (zero.mad.action %in% c(  "na", "warn and na")){
         if (left.mad  == 0) left.mad  <- NA
         if (right.mad == 0) right.mad <- NA
      }
   }
   return(c(left.mad, right.mad))
}

DoubleMADsFromMedian <- function(x, zero.mad.action="warn"){
   # The zero.mad.action determines the action in the event of an MAD of zero.
   # Possible values: "stop", "warn", "na" and "warn and na".
   two.sided.mad <- DoubleMAD(x, zero.mad.action)
   m <- median(x, na.rm=TRUE)
   x.mad <- rep(two.sided.mad[1], length(x))
   x.mad[x > m] <- two.sided.mad[2]
   mad.distance <- abs(x - m) / x.mad
   mad.distance[x==m] <- 0
   return(mad.distance)
}

#EXAMPLE
x <- c(1, 4, 4, 4, 5, 5, 5, 5, 7, 7, 8, 10, 16, 30)
print(x[DoubleMADsFromMedian(x) > 3])

FJCC · January 14, 2020, 9:25pm

You just need to use the complementary logical condition of the one used in the last print statement.

DoubleMAD <- function(x, zero.mad.action="warn"){
  # The zero.mad.action determines the action in the event of an MAD of zero.
  # Possible values: "stop", "warn", "na" and "warn and na".
  x         <- x[!is.na(x)]
  m         <- median(x)
  abs.dev   <- abs(x - m)
  left.mad  <- median(abs.dev[x<=m])
  right.mad <- median(abs.dev[x>=m])
  if (left.mad == 0 || right.mad == 0){
    if (zero.mad.action == "stop") stop("MAD is 0")
    if (zero.mad.action %in% c("warn", "warn and na")) warning("MAD is 0")
    if (zero.mad.action %in% c(  "na", "warn and na")){
      if (left.mad  == 0) left.mad  <- NA
      if (right.mad == 0) right.mad <- NA
    }
  }
  return(c(left.mad, right.mad))
}

DoubleMADsFromMedian <- function(x, zero.mad.action="warn"){
  # The zero.mad.action determines the action in the event of an MAD of zero.
  # Possible values: "stop", "warn", "na" and "warn and na".
  two.sided.mad <- DoubleMAD(x, zero.mad.action)
  m <- median(x, na.rm=TRUE)
  x.mad <- rep(two.sided.mad[1], length(x))
  x.mad[x > m] <- two.sided.mad[2]
  mad.distance <- abs(x - m) / x.mad
  mad.distance[x==m] <- 0
  return(mad.distance)
}

#EXAMPLE
x <- c(1, 4, 4, 4, 5, 5, 5, 5, 7, 7, 8, 10, 16, 30)
print(x[DoubleMADsFromMedian(x) > 3])
#> [1]  1 16 30
x_cln <- x[DoubleMADsFromMedian(x) <= 3]
x_cln
#>  [1]  4  4  4  5  5  5  5  7  7  8 10

^{Created on 2020-01-14 by the reprex package (v0.3.0)}

Longshot408 · January 14, 2020, 11:26pm

This almost works. The problem is that x_cln and x are isolated vectors outside of my data set; both are just hanging out in the "Values" portion of my Environment. And since this newly cleaned variable has less rows than the main data set, I can't just leftjoin it back in.

I need a way to use this function to identify outliers, and then remove the entire row (i.e. person/response) in my data set associated with the given numbers.

Longshot408 · January 14, 2020, 11:56pm

I think I found a work-around solution, but could still use help. I tried finding the range of the cleaned vector to see where the cut points are (i.e. what values were retained after cleaning), and then using a statement to cut above that. There were none below it.

range(sona_cln)
sonadata<-sonadata[-(sonadata$Duration<=1100),]

For some reason though this code only removes a single outlier at a time; If I run it once the count of people drops from 430 to 429, and then again to 428 if I run it again, etc. Is there any reason why it would be doing this instead of removing all instances above 1100?

FJCC · January 15, 2020, 12:09am

If x has as many elements as sonadata has rows, you can use

sonadata <- sonadata[DoubleMADsFromMedian(x) <= 3, ]

That works because DoubleMADsFromMedian(x) <= 3 returns a vector of TRUE and FALSE values. Wherever it is FALSE, the entire row will be removed from sonadata

Longshot408 · January 15, 2020, 12:13am

that worked!! Thanks!!!!

system · January 22, 2020, 12:13am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.