smote and perc.over and perc.under

Hi i'm using smote function, but I didn't understand the perc.over and perc.under functions and what is the logic behind them. Thanks for the attention

Hi, have you had a look at the documentation? SMOTE function - RDocumentation

1 Like

Yes, I did. But I didn't understand the logic. Is the associated number in percentage terms? How many new observations do you generate, for example, if I type per.over = 200? Thanks for your help.

First of all, I want to make it clear what SMOTE the method is doing. SMOTE is a method that generated artificial data points from the minority classes based on straight paths between nearest neighbors in the minority class observations.
SMOTE() the function doesn't just perform SMOTE, it also performs undersampling by randomly removing observations from the majority class.

I'll create an example data set to help here. It has the majority class "common" and minority class "rare".

library(DMwR)
library(dplyr)

new_iris <- iris[-(1:27), ]
new_iris$Species <- factor(ifelse(new_iris$Species == "setosa","rare","common"))

new_iris %>% count(Species)
#>   Species   n
#> 1  common 100
#> 2    rare  23

I'm going over the two arguments perc.over and perc.under one at a time to clarify what is happening. Starting with perc.over. To make things more clear I am going to set perc.under = 0 for the following examples, this will eliminate the majority class so we can focus on what happens to the majority class.

If we set perc.over = 100 we will get a 100% new SMOTEd observations. So for each of the "rare" observations, we are getting another so we end up with 23 + 23 = 46 observations.

SMOTE(Species ~ ., new_iris, perc.over = 100, perc.under = 0) %>%
  count(Species)
#>   Species  n
#> 1    rare 46

And this pattern continues when you increase perc.over, such that we have 23 + 23 * 2 = 69

SMOTE(Species ~ ., new_iris, perc.over = 200, perc.under = 0) %>%
  count(Species)
#>   Species  n
#> 1    rare 69

and 23 + 23 * 6 = 161.

SMOTE(Species ~ ., new_iris, perc.over = 600, perc.under = 0) %>%
  count(Species)
#>   Species   n
#> 1    rare 161

It is worth noting here that perc.over is rounded down to the nearest 100. So 200, 201, 250, and 299.999 are all going to return the same number of observations, as SMOTE() will try to generate the same number of new points for each observation in the minority class.

SMOTE(Species ~ ., new_iris, perc.over = 290, perc.under = 0) %>%
  count(Species)
#>   Species  n
#> 1    rare 69

Lastly if perc.over is between 0 and 100 then it will generate a proportion accordingly. So here we set perc.over = 25 which mean we will generate synthetic observations based on 25% of the minority observations floor(23 + 23 * 0.25) = 28.

SMOTE(Species ~ ., new_iris, perc.over = 25, perc.under = 0) %>%
  count(Species)
#>   Species  n
#> 1    rare 28

Now we bring back perc.under. perc.under denotes a proportion related to the number of observations that were created from the minority class. This means that in the following example, 2 * 23 = 46 observations were added to the minority class "rare", so having perc.under = 100 mean that we get 100% of those 46

SMOTE(Species ~ ., new_iris, perc.over = 200, perc.under = 100) %>%
  count(Species)
#>   Species  n
#> 1  common 46
#> 2    rare 69

perc.under isn't rounded to the nearest 100, so you can use any amount you want here. You just have to remember that it is a percentage. floor(46 * 1.8) = 82

SMOTE(Species ~ ., new_iris, perc.over = 200, perc.under = 180) %>%
  count(Species)
#>   Species  n
#> 1  common 82
#> 2    rare 69

This means that the number of majority cases is defined by both using the formula floor(number of minority * floor(perc.over/100) * perc.under/100).

SMOTE(Species ~ ., new_iris, perc.over = 400, perc.under = 180) %>%
  count(Species)
#>   Species   n
#> 1  common 165
#> 2    rare 115
1 Like

Thanks. You were very clear and explanatory! :grinning:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.