High Accuracy- seems fishy

I am trying to build a Supervised Classification based Predictive Model. The data is consists of 13 qualitative variables. I built a predictor based on three columns and now I am trying to apply Logistic regression, SVM against it. I am getting 99% accuracy which doesn't seems right. Do anyone have any suggestions on what I might be doing wrong?

Thanks.

Welcome to the community, Shahna.

Your question is not quite informative to provide any help. Can you please turn this into a reproducible example? If you don't know how, here's a great link:

1 Like

I have a data set that has following structure:
|Company|Product|Item|Response|Dispute|Efficiency|
|C1|P1|I1|No|Yes|Good|
|C2|P2|I2|Yes|No|Bad|
|C3|P3|I3|No|No|Bad|
|C4|P4|I4|Yes|Yes|Moderate|

I created the efficiency column based on the Item, Response and Dispute value.
Predicted Efficiency based on rest of the predictors using Logistic regression.
The confusion matrix shows an accuracy as 99%.
This seems a little odd to me.

How balanced is the data? Sometimes a binary classification like logistic regression will yield high accuracy because the data is highly imbalanced between the two classes. For example, if the real world truth of your data is that one class occurs 99% of the time, your model could achieve 99% accuracy by always guessing the same thing.

Yarnabrina is correct in stating that you need to provide some sort of reproducible example if you want the community to be able to give anything more than general statements in response to your question.

1 Like

So here is the result of running the SVM algorithm:

library(e1071)
svm1 <- svm(Efficiency~., data=train,
method="C-classification", kernal="radial",
gamma=0.1, cost=10)
summary(svm1)
#--------
Call:
svm(formula = Efficiency ~ ., data = train, method = "C-classification",
kernal = "radial", gamma = 0.1, cost = 10)

Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 10
gamma: 0.1

Number of Support Vectors: 61

( 39 21 1 )

Number of Classes: 3

Levels:
Bad Good Moderate
#------------
prediction <- predict(svm1, train)
xtab <- table(train$Efficiency, prediction)
xtab

#------------------
prediction
Bad Good Moderate
Bad 1 0 0
Good 0 48 0
Moderate 0 0 21

Thank you for your reply Jason.
What would you suggest would be a right approach to deal with a highly imbalanced data?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.