Winsorize data by column

Hi! I'm new to R and would like to winsorize my data since trimming is no option due to my limited number of observations.

My data looks like follow, in total I have 131 observations:

                         company   id    rev   size age 
1                           Adeg 29.9   0.66    160  45     
2                         Agrana 32.0   2.80   9191  29     
3                        Allianz 36.5  87.75 142460 128     
4                        Andritz 34.0   6.89  29096 118     
5                          Apple 41.0 259.65 132000  41

i would like to winsorize the variable " rev ", so actually winsorize my data by column (not the whole dataset). If i use the function winsorize(data1), an error appears since variable company is not numeric.

  1. How can i winsorize the data by column?

To avoid this problem I created a second dataset, data2 without variable company. By applying function winsorize(data2), another error appears again:

winsorize(data2)
Error in eigen(R, symmetric = TRUE) : infinite or missing values in 'x'
  1. How can I winsorize my data correctly?

I would be very happy if someone could help me out!

Hi @LAH_17,

You can apply the Winsorize() function to a specific column of a data set with:

library(dplyr)

iris %>% 
  mutate(wins_var = Winsorize(Sepal.Length))

You can replace the data set and variables with your own. Note: I assumed you were using the Winsorize() function from the DescTools package, because you didn't specify

hi @mattwarkentin
no, i used the winsorize() function form robustHD package. What is the difference?

I have also found following suggestion (still robustHD package):
winsorize(data1$rev)
The output looks as follow:

winsorize(data1$rev)
  [1]  0.66000  2.80000 21.22521  6.89000 21.22521  0.09000 21.22521  1.36000  5.48000  0.70000  0.79000  0.35000
 [13] 21.22521  0.55000  0.94000  0.06000 12.36000 13.58000  7.95000  0.29000  7.80000  0.39000 21.22521  0.09000
 [25] 21.22521  0.27000  0.32000  0.08000  0.03000  0.41000 21.22521  0.66000 20.91000  0.67000  0.04000  1.39000
 [37]  0.17000  0.14000  1.79000  0.05000  2.52000  3.68000  0.24000  0.09000 21.22521  8.43000  0.20000  0.17000
 [49] 21.22521  3.05000  0.07000  0.01000  0.82000  0.57000 21.22521  0.28000  0.05000  5.72000  6.12000  4.09000
 [61]  0.05000  0.22000 21.22521 21.22521 21.22521  0.20000 17.32000  5.63000  3.25000  0.12000  0.02000  0.07000
 [73] 10.89000  3.79000  1.89000 21.22521  9.98000 10.58000 21.22521 21.22521 15.55000 15.21000  5.93000 21.22521
 [85]  1.59000  3.00000 11.19000  6.10000  0.08000 21.22521 21.22521 17.74000 20.92000  6.46000  3.18000  0.01000
 [97]  0.81000  9.15000 21.22521  0.00000  1.34000  7.97000 21.22521 21.22521 21.22521  0.38000  0.65000 21.22521
[109]  9.44000  8.66000  5.30000 11.83000 20.06000 21.22521  1.15000  2.32000 21.22521 21.22521  9.98000 10.58000
[121] 11.05000 21.22521 21.22521 15.55000 15.21000  5.93000  1.59000  1.03000  3.00000 11.19000  6.10000

Are these my outliers or what are these values?

I recommend consulting the help document for the winsorize function with ?winsorize. This is a good starting point when using a function whereby you may not know how to use it or what it returns. There is a section titled Value which describes what the function returns.

If standardize is TRUE and return is "weights", a set of data cleaning weights. Multiplying each
observation of the standardized data by the corresponding weight yields the cleaned standardized
data.
Otherwise an object of the same type as the original data x containing the cleaned data is returned.

So depending on how you call the function, you either get the weights, or the cleaned data. Based on how you called the function in the above code, those values are the cleaned data (i.e. the original data but with outliers shrunk).

The same approach should as before should apply:

data1 %>%
  mutate(rev2 = winsorize(rev))

thank you! @mattwarkentin
so i also tried the winsorize function from the DescTools package. Since I have generated a single vector with my variable of interested, i tried to run the function as follow:
Winsorize(rev_vector), where rev_vector is numeric.
R gives me following error:

> Winsorize(rev_vector)
Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) : 
  undefined columns selected

I haven't found any replies on such an issue yet... Any ideas what this means?

Hi @LAH_17,

I was able to reproduce the error with the following code:

iris %>% select(Sepal.Length) %>% DescTools::Winsorize(.)

Error in [.data.frame(x, order(x, na.last = na.last, decreasing = decreasing)) :
undefined columns selected

The issue is that Winsorize() only accepts a single numeric vector as its first argument. A data frame with only one variable is still a data frame, not a vector. So based on the error you are getting, it seems like your single variable is still a variable in a data frame. You will need to extract the vector from the data frame like so:

DescTools::Winsorize(rev_vector$rev) # where rev is the name of the variable

Or use an approach that ensures the function is ran inside the context of the data frame. Here are two such examples:

# Base R approach
with(rev_vector, DescTools::Winsorize(rev))
# Tidy approach
library(dplyr)
rev_vector %>%
  mutate(rev2 = DescTools::Winsorize(rev))