How do I remove outliers?

EphraMP · March 12, 2022, 7:29am

Let's say, I have a data set called D with n rows and m columns. In the data set there are some categorical variables. Let's say, I need to analyze some variable respect to the categorical variables. How do I remove the outliers from the entire data set? I tried to use rm.outlier() from the outlier package, but it isn't working as I want, due the fact that it returns a new array, instead of removing the entire row where the outlier is.

Some idea to solve this problem?

nirgrahamuk · March 14, 2022, 11:06am

I'll prefix what I really want to say with an initial comment that an outlier is a somewhat subjective and context dependent notion...
Having gotten that out of the way, I'd like to ask, how you came to this idea of removing columns that contain outlying values ? it seems like a recipe for removing all your columns ...

arthur.t · March 14, 2022, 12:13pm

Removing outliers is not about removing columns. In some cases, you may choose to remove rows that have an outlier in one or more columns.

EphraMP · March 14, 2022, 5:11pm

Confused columns with rows in my head. Sorry, not an English native speaker

EphraMP · March 14, 2022, 5:12pm

yes, my bad, sorry. I got confussed.

And how do I do that?

nirgrahamuk · March 14, 2022, 6:28pm

Ok, thats fine.
Do you have any particular definition of outlier that makes sense for your context (which you haven't shared with us yet) that you wish to apply ?

EphraMP · March 14, 2022, 10:59pm

Yes.

A value under the first quantile minus 1.5 the IQR or over the third quantile plus 1.5 times the IQR.
They are the dots drawed by boxplots, as I understand.

nirgrahamuk · March 15, 2022, 11:48am

Example using mtcars dataset for data.

library(tidyverse)

df_of_interest <- mtcars

#the data.frame may contain more than only number columns, 
# so determine the names of the number columns
(to_do <- df_of_interest %>%
  select(where(is.numeric)) %>%
  names())

# calculated first and 3rd quartile
calc_quants <- function(x) {
  map(
    to_do,
    ~ {
      enframe(quantile(x[[.]], probs = c(.25, .75)),
        value = .
      )
    }
  ) %>% reduce(left_join)
}

(inner_quartile_df <- calc_quants(df_of_interest))


# re organise quartile info
(iq_df2 <- inner_quartile_df %>%
  rename(quantile = name) %>%
  pivot_longer(cols = -"quantile") %>%
  group_by(quantile) %>%
  group_split())

(iqr_df <- left_join(iq_df2[[1]],
  iq_df2[[2]],
  by = "name"
) %>%
  select(name, lower = value.x, upper = value.y) %>%
  mutate(
    iqr = upper - lower,
    low_crit = lower - iqr * 1.5,
    hi_crit = upper + iqr * 1.5
  ))

# for each column to process, determing the rows it would omit,
# collate these 
(rows_to_omit <- map(
  to_do,
  ~ {
    ovec <- pull(
      df_of_interest,
      .x
    )
    criteria <- filter(iqr_df, name == .x)
    which(!between(
      x = ovec, left = criteria$low_crit,
      right = criteria$hi_crit
    ))
  }
) %>% unlist() %>% sort() %>% unique())

# finish
new_data <- df_of_interest %>%
  slice(-rows_to_omit)

EphraMP · March 16, 2022, 5:10am

Thanks!!

I'll definitely check this out!

system · March 23, 2022, 5:11am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.