I need your opinion my code

Hey guys, i am new to codeing and i wanted to use the quarantee to learn something new. I read about the interesting topic "dealing with imbalanced data" and i wanted to try it by myself. I want to use an imbalanced binary 2-class dataset and "make it better" for ML. At first i spiltted my data into training and test data. I fitted my train data with CART. Then i calculated the ConfusionMatrix for the train data to have a starting point. Then i ran techniques over my train-data like SMOTE,ROSE, over and under-sampling. Afterwards i fitted each new balanced data again using CART and recalculated the ConfusionMatrix and tried to see if the balanced accuracy changed.
Is this the correct way to handle such problem.

Here is my code. If you have a suggestions what to change, please tell me!!

setwd("C:\\Users\\loren\\Dropbox\\Uni\\Präsentation\\Datensätze")
data <- read.csv("creditcard.csv")
head(data)
glimpse(data)
prop.table(table(data$Class))
table(data$Class)
summary(data)
str(data)

# AUfteilen der Daten in Train/Test-Data
library(caret)
index <- createDataPartition(data$Class, p = 0.8, list = FALSE)
train_data <- data[index, ]
test_data  <- data[-index, ]


# Verteilung der Daten
table(train_data$Class)
prop.table(table(train_data$Class))
nrow(train_data)
table(test_data$Class)
prop.table(table(test_data$Class))
nrow(test_data)


# Confusion-Matrix für Test-Data
library(rpart)
library(caret)
#install.packages("e1071")
library(e1071)
fit_train <- rpart(Class ~ ., data = train_data, method = "class", control = rpart.control(cp = 0)) # rpart = rekurisves Partitioniern
prune_train <- prune(fit_train, cp = 0.0084 )
rpart.plot(prune_train)
summary(fit_train)
rpart.plot(fit_train, extra=4)
printcp(fit_train)
plotcp(fit_train)


pred_fit_train <- predict(fit_train, newdata = test_data, type = "class")
table(test_data$Class, pred_fit_train)


# Accuarcy / Specififity / Sensititivty / Precision / Recall der Test-Daten
confusionMatrix(data = pred_fit_train ,
                reference = factor(test_data$Class),
                positive = "1")
# NoInfoRate < Accuarcy
# Sensitivit --> für 0 predicten
# Specificity --> für 1 predicten

# MCC
install.packages("mccr")
library(mccr)
mccr(fit_train, pred_fit_train)


# Faktorisiern der Train/Test-Daten
test_data$Class <- factor(test_data$Class)
train_data$Class <- factor(train_data$Class)


# Down-Sample
library(caret)
down_train <- downSample(x = train_data[, -ncol(train_data)],
                         y = train_data$Class)
table(down_train$Class)
prop.table(table(down_train$Class))
fit_down <- rpart(Class~., data = down_train, method = "class")
pred_down <- predict(fit_down, newdata = test_data, type = "class")
summary(pred_down)


# ConfusionMatrix Down-Sample
confusionMatrix(data = pred_down ,
                reference = factor(test_data$Class),
                positive = "1")


# Up-Sample
library(caret)
up_train <- upSample(x = train_data[, -ncol(train_data)],
                     y = train_data$Class)
table(up_train$Class)
prop.table(table(up_train$Class))
fit_up <- rpart(Class~., data = up_train, method = "class")
pred_up <- predict(fit_up, newdata = test_data, type = "class")
summary(pred_up)


# ConfusionMatrix Up-Sample
confusionMatrix(data = pred_up ,
                reference = factor(test_data$Class),
                positive = "1")


# SMOTE
install.packages("smotefamily")
library(smotefamily)
library(DMwR)
smote_train <- SMOTE(Class~., data = train_data) #punkt oder nicht?
table(smote_train$Class)
prop.table(table(smote_train$Class))
fit_smote <- rpart(Class~., data = smote_train, method = "class")
pred_smote <- predict(fit_smote, newdata = test_data, type = "class")
summary(pred_smote)


# ConfusionMatrix SMOTE
confusionMatrix(data = pred_smote ,
                reference = factor(test_data$Class),
                positive = "1")

# ROSE
library(ROSE)
rose_train <- ROSE(Class~., data = train_data)$data
table(rose_train$Class)
prop.table(table(rose_train$Class))
fit_rose <- rpart(Class~., data = rose_train, method = "class")
rpart.plot(fit_rose)
pred_rose <- predict(fit_rose, newdata = test_data, type = "class")
summary(pred_rose)
accuracy.meas(test_data$Class, pred_rose)


# COnfusionMatrix ROSE
confusionMatrix(data = pred_rose ,
                reference = factor(test_data$Class),
                positive = "1")

Hi, Lorenz. Thanks for the almost complete regexp; the only thing it's missing is the data object. Without it, I can only eyeball the code. I can't examine the objects, see if the end results agree with the form I expect them to be in \ldots . You get the idea, I'm sure.

Others, who actually have more recent experience in doing this may not need this help, but having a cut-and-paste always removes the problem of reverse engineering.

The data doesn't have to be all of your data, or even your data at all, so long as it is representative. A standard base or other package dataset is ideal, because it's readily available.

To bring in the data

dput(my_data)

will output it and can be cut and pasted into the reprex from there.

A shortcut for direct-to-clipboard is a snipped I stole from a source that I've unfortunately forgotten

require(clipr)
#> Loading required package: clipr
#> Welcome to clipr. See ?write_clip for advisories on writing to the clipboard in R.
require(magrittr)
#> Loading required package: magrittr
require(stringr)
#> Loading required package: stringr

specimen <- function(x)
  deparse(x) %>%
  str_c(collapse = '') %>%
  str_replace_all('\\s+', ' ') %>%
  str_replace_all('\\s*([^,\\()]+ =) (c\\()', '\n  \\1\n    \\2')  %>%
  str_replace_all('(,) (class =)', '\\1\n  \\2') %>%
  write_clip(allow_non_interactive = TRUE)

Hey, thanks for the anwser. I am not quite sure what you mean, but i guess that examine the code without having the actual dataset right? I got from here https://www.kaggle.com/mlg-ulb/creditcardfraud

We don't need all 144MB to work the problem, only enough to illustrate it.

This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

So what I would do is to simply create an enriched version with all the frauds and, say 3-4 times as many other records.

So, if you think that will exercise your code, I'll make you an offer: You write the code to create it and I'll run it against the entire dataset and put the toy dataset on a github 'gistthat can be read in withreadr::read_csv()`

How does that sound?

Hey sorry, thanks for putting effort in helping me, but i dont really get what you want me to do. You want a "slimmed" down version of the data set?! Sorry, i am really new to coding.

Hi @Mingabua,

People here are willing to help if you present "your" data.
Because of that file is very big:

you can show a part of data doing like this:

creditcard <- read_csv("creditcard.csv")
  
partial_credit_card <- creditcard %>% select (1:10) %>% head(n=20)

dput(partial_credit_card)

than copy your console output from last command into new R script file and assign it to an object partial_credit_card or whatever name you want:

than it gives you:

and than:

and then select ALL, copy and paste here between (3 backtics up and down) ```


So finally your subsetted data looks like this:


I hope this helps,
regards,
Andrzej

@Andrzej Thanks, now i get it. The problem is, that the "new" data sets doesnt have the "class" variable, which is the relevant one, cause i want to use a binary classification.

Hi,
Your dataset has got Class variable in it.


You can choose columns you want with select(1,2,31) and number of rows with head(n= number of rows you want).
regards,
Andrzej

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.