How do I remove outliers?

Let's say, I have a data set called D with n rows and m columns. In the data set there are some categorical variables. Let's say, I need to analyze some variable respect to the categorical variables. How do I remove the outliers from the entire data set? I tried to use rm.outlier() from the outlier package, but it isn't working as I want, due the fact that it returns a new array, instead of removing the entire row where the outlier is.

Some idea to solve this problem?

I'll prefix what I really want to say with an initial comment that an outlier is a somewhat subjective and context dependent notion...
Having gotten that out of the way, I'd like to ask, how you came to this idea of removing columns that contain outlying values ? it seems like a recipe for removing all your columns ...

Removing outliers is not about removing columns. In some cases, you may choose to remove rows that have an outlier in one or more columns.

Confused columns with rows in my head. Sorry, not an English native speaker

yes, my bad, sorry. I got confussed.

And how do I do that?

Ok, thats fine.
Do you have any particular definition of outlier that makes sense for your context (which you haven't shared with us yet) that you wish to apply ?

Yes.

A value under the first quantile minus 1.5 the IQR or over the third quantile plus 1.5 times the IQR.
They are the dots drawed by boxplots, as I understand.

Example using mtcars dataset for data.

library(tidyverse)

df_of_interest <- mtcars

#the data.frame may contain more than only number columns, 
# so determine the names of the number columns
(to_do <- df_of_interest %>%
  select(where(is.numeric)) %>%
  names())

# calculated first and 3rd quartile
calc_quants <- function(x) {
  map(
    to_do,
    ~ {
      enframe(quantile(x[[.]], probs = c(.25, .75)),
        value = .
      )
    }
  ) %>% reduce(left_join)
}

(inner_quartile_df <- calc_quants(df_of_interest))


# re organise quartile info
(iq_df2 <- inner_quartile_df %>%
  rename(quantile = name) %>%
  pivot_longer(cols = -"quantile") %>%
  group_by(quantile) %>%
  group_split())

(iqr_df <- left_join(iq_df2[[1]],
  iq_df2[[2]],
  by = "name"
) %>%
  select(name, lower = value.x, upper = value.y) %>%
  mutate(
    iqr = upper - lower,
    low_crit = lower - iqr * 1.5,
    hi_crit = upper + iqr * 1.5
  ))

# for each column to process, determing the rows it would omit,
# collate these 
(rows_to_omit <- map(
  to_do,
  ~ {
    ovec <- pull(
      df_of_interest,
      .x
    )
    criteria <- filter(iqr_df, name == .x)
    which(!between(
      x = ovec, left = criteria$low_crit,
      right = criteria$hi_crit
    ))
  }
) %>% unlist() %>% sort() %>% unique())

# finish
new_data <- df_of_interest %>%
  slice(-rows_to_omit)
1 Like

Thanks!!

I'll definitely check this out!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.