I am working on a classification problem with some imbalanced data. It was suggested that I try to use SMOTE method to sample up. I found several online references to smote in R but the most popular one seems to be DMwR. I also found a reference to the 'unbalanced' package.
On my real data, I am receiving the message in the title:
Error in T[i, ] : subscript out of bounds
In addition: There were 20 warnings (use warnings() to see them)
warnings()
Warning messages:
1: In FUN(newX[, i], ...) : no non-missing arguments to max; returning -Inf
2: In FUN(newX[, i], ...) : no non-missing arguments to max; returning -Inf
3: In FUN(newX[, i], ...) : no non-missing arguments to max; returning -Inf...
I tried to create a reprex using diamonds dataset. That failed since I encountered another error. But the setup 'should' be the same in that the dataframe I am passing to smote in both my real data and example data are similar in that the target is imbalanced, a factor and has values 0 or 1. So I wanted to post anyway in case the errors are related or if I've just misunderstood how to use DMwR::smote()
library(tidyverse)
# make a dummy target variable
diamonds$cut %>% table # 'Fair' is the smallest, ise this as an example
my_diamonds <- diamonds %>% mutate(target_var = factor(ifelse(cut == "Fair", 1, 0)))
my_diamonds$target_var %>% table # imbalanced
# Goal: balanced target_var
library(DMwR) # also saw the library 'unbalanced' elsewhere online but looks like DMwR has a larger presense
balanced.diamonds <- SMOTE(target_var ~ carat + color,
my_diamonds, perc.over = 100)
If I run that block I get:
Error in matrix(if (is.null(value)) logical() else value, nrow = nr, dimnames = list(rn, :
length of 'dimnames' [2] not equal to array extent
How can I use SMOTE to create new samples of my_diamonds$target_var
so that that my_diamonds$target_var %>% table
will have an equal number of both labels?
Any tips on my other error method much appreciated too.