Handling Class Imbalance for Large Dataset

Please, what is the best way to handle the class imbalance of a large dataset? I have a dataset of over 300k rows, whose target variable has imbalanced classes. I have tried using ROSE to balance out the training dataset, after an 80/20 split, but it keeps returning an empty table of classes. This is my code:

library(ROSE)
library(DMwR)
library(caret)

ind <- createDataPartition(heart_df$HeartDisease,p = 0.8,list = F)
train_heart <- heart_df[ind,]
test_heart <- heart_df[-ind,]
nrow(train_heart)
nrow(test_heart)

set.seed(111)
trainUp <- ROSE(HeartDisease ~.,data = train_heart)$heart_df
table(trainUp$HeartDisease)

Here is a screenshot of the data:

heart

There are more "No" than "Yes", and so I want to balance out the training data. But the table(trainUp$HeartDisease) code returns the following output in my console: < table of extent 0 > instead of the adjusted classes. Please, I will appreciate your help, thank you.

Hello, this is not quite a reprex, as it seems to rely both on unshared data (heart_df) and functions not declared by the listed library calls (createDataPartition). Could you review these elements ?

1 Like

I didn't find a way to upload the .csv file, so I shared a screenshot.

I'm sure you shared this image with the best intentions, but perhaps you didnt realise what it implies.
If someone wished to use example data to test code against, they would type it out from your screenshot...

This is very unlikely to happen, and so it reduces the likelihood you will receive the help you desire.
Therefore please see this guide on how to reprex data. Key to this is use of either datapasta, or dput() to share your data as code

1 Like

If you don't mind, can I email you a sample of the data? Even with datapasta, the table is not looking nice here.

I'm going to make an educated guess that

 ROSE(HeartDisease ~.,data = train_heart)

runs ok, and shows you output

my guess is that you are accessing $heard_df from it, where that isn't there

rather I'd expect

trainUp <- ROSE(HeartDisease ~.,data = train_heart)$data

to pull out the relevant content

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.