Is there a function to replace outliers?

I have a big dataset need to replace outliers with mean of the variable, is there a function to do that? lets take a example with the small dataset below:
data <- airquality
View(data)

library(outliers)
outlier(data)
following outlier can be found
Ozone Solar.R Wind Temp Month Day
168.0 7.0 20.7 56.0 9.0 31.0

How to replace them with column mean? Thank you!

The first step must be to calculate the column means. Would you calculate from the whole dataset or from the non outliers?

Try looking at this small example:

set.seed(850692)
s <- c(rnorm(90), rnorm(10, sd = 10))
b <- boxplot(s, plot = FALSE)
s1 <- s
s1[which(s %in% b$out)] <- mean(s)
par(mfrow=c(1,2))
boxplot(s)
boxplot(s1)

Hope it helps :slightly_smiling_face:

1 Like

Mean for non outliers, if mean for non-outliers not easy to be calculated, mean for entire column also ok, may replace outliers with median. thank you!

Is it possible to apply to a dataset (multiple columns)? Thank you!

do_col <- function(c){
b <- boxplot(c, plot = FALSE)
s1 <- c
s1[which(c %in% b$out)] <- mean(c[which(! c %in% b$out)],na.rm=TRUE)
return(s1)
}

# (testvec <- c(rep(1,9),100))
# do_col(testvec)
library(tidyverse)
columns_to_do <- names(select_if(iris,is.numeric))

purrr::map_dfc(columns_to_do,
           ~do_col(iris[[.]])) %>% set_names(columns_to_do)
2 Likes

Thank you !

do_col <- function(c){
b <- boxplot(c, plot = FALSE)
s1 <- c
s1[which(c %in% b$out)] <- mean(c[which(! c %in% b$out)],na.rm=TRUE)
return(s1)
}

# (testvec <- c(rep(1,9),100))
# do_col(testvec)
library(tidyverse)
columns_to_do <- names(select_if(iris,is.numeric))

purrr::map_dfc(columns_to_do,
           ~do_col(iris[[.]])) %>% set_names(columns_to_do)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.