Hello all,
I'm trying to figure out a way to remove outliers from my study based on completion time. Specifically, those who rushed through it too quickly, or who sat there forever and weren't putting in effort. I think the code for the Double MAD on this page is perfect for this, since completion times are not normally distributed and seriously skewed right. For example, the median completion time in my study is about 432 seconds, but some people have over 10,000 seconds (166 minutes) as their total time. This study shouldn't take any more than 12-15.
However, the code provided by the author (pasted below) simply prints the outliers identified instead of removing them. Given that I have 35 outliers in my data, I don't want to have to sort through it and remove them one by one. Does anyone know how to tweak this to get it to remove those identified by the function my my data set?
DoubleMAD <- function(x, zero.mad.action="warn"){
# The zero.mad.action determines the action in the event of an MAD of zero.
# Possible values: "stop", "warn", "na" and "warn and na".
x <- x[!is.na(x)]
m <- median(x)
abs.dev <- abs(x - m)
left.mad <- median(abs.dev[x<=m])
right.mad <- median(abs.dev[x>=m])
if (left.mad == 0 || right.mad == 0){
if (zero.mad.action == "stop") stop("MAD is 0")
if (zero.mad.action %in% c("warn", "warn and na")) warning("MAD is 0")
if (zero.mad.action %in% c( "na", "warn and na")){
if (left.mad == 0) left.mad <- NA
if (right.mad == 0) right.mad <- NA
}
}
return(c(left.mad, right.mad))
}
DoubleMADsFromMedian <- function(x, zero.mad.action="warn"){
# The zero.mad.action determines the action in the event of an MAD of zero.
# Possible values: "stop", "warn", "na" and "warn and na".
two.sided.mad <- DoubleMAD(x, zero.mad.action)
m <- median(x, na.rm=TRUE)
x.mad <- rep(two.sided.mad[1], length(x))
x.mad[x > m] <- two.sided.mad[2]
mad.distance <- abs(x - m) / x.mad
mad.distance[x==m] <- 0
return(mad.distance)
}
#EXAMPLE
x <- c(1, 4, 4, 4, 5, 5, 5, 5, 7, 7, 8, 10, 16, 30)
print(x[DoubleMADsFromMedian(x) > 3])