Removing outliers for each variable

I can see the outliers in that variable with the following code:

ggplot(df_a) +
  aes(x = "", y = SupDem) +
  geom_boxplot(fill = "#0c4c8a") +
  theme_minimal()

But after showing them, how can I remove them from that specific variable in my dataset?

removal might mean you want to throw away the entire row of information; or that you want to impute missing values i.e. fill them.

1 Like

That makes much more sense to impute the outliers with missing values. How can I fill the outliers with missing values?

I guess there is a distinction here; one is either choosing to take the outlier value and set it to NA/missing, and leave it at that; or go an extra step and impute what the value may have been.

package mice is popular for missing values imputation i.e. the second step.

geom_boxplot documentation shows how it considers values greater than 1.5*IQR away from the outer quartiles to be outliers.

consider this example:

(mpg_compact <- filter(mpg,class=="compact")) |> select(hwy)
p <- ggplot(mpg_compact, aes(y=hwy))
p + geom_boxplot()

(myvec <- mpg_compact$hwy)
(lower_quartile <- quantile(myvec,.25))
(upper_quartile <- quantile(myvec,.75))
(IQR <- diff(range(upper_quartile,lower_quartile)))
(above <- 1.5*IQR)
(upper_whisker <- quantile(myvec,.75) + above)
#location
(is_upper_outlier <- myvec > upper_whisker)
which(is_upper_outlier)

note in this example the 3 dots seen visually are actually 4 data entries, because a pair of them are the same value (35)

1 Like

What does "compact" represent in this example if I may ask? I have only one column that is to be processed it is called SupDem and the name of the dataset is df_a.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.