Logistic linear regression failed - wants to know other model which suits this



I got the case study on banking datasets to identify loan defaulters. I tried to used logistic regression model to get inference of data. But it works only with binary dependent data. Can anyone help me to know which model will suits for this case study. I sharing the glimpse of dataset for reference.

The only categorical variable present in status which include A, B,C and D. I tried with status by using cbind keyword and got the value. post that when I use prediction and AOC then I failed to get the output. Throwing error only binary value 0<y<1 will be needed to run the code.

please help me to know which other model suits this kind of question if possible please share some example with me. so I can understand in better way.


It is much easier to help you, if you supply us with a reprex.

If you would like to use logistic regression for multi-class classification, then you could use the one-vs-all approach, e.g. A vs B,C,D, so group A yes/no


Hypothesis -
The Loans Division of Bank want to know the accounts who are likely to default in repaying the loans when the contract ends

execution problem -

getting the following error while running confusion matrix (Error: data and reference should be factors with the same levels.) . please check and help me on this

loan <-
 read.csv("C:/Users/sao/Downloads/banking_data/Banking_Data/loan.txt", sep=';')
trans <- read.csv("C:/Users/sao/Downloads/banking_data/Banking_Data/trans.txt", sep=';') 
trans <- subset(trans, select = c(account_id,balance,k_symbol))

loanaccount <- merge(trans, loan, by="account_id")
loanaccount <- subset(loanaccount,select = -c(loan_id))

##checking missing value

##duplicated values

## create training and test data

##data split
datasplit <- sample(nrow(loanaccount), round(nrow(loanaccount)*0.8))
trainigdata <- loanaccount[datasplit,]
testdata <- loanaccount[-datasplit,]

## loan amount distribution and box plot

give_count <-  stat_summary(fun.data = function(x) return(c(y = median(x)*1.06,                                             label = length(x))),
               geom = "text")

give_mean <- 
  stat_summary(fun.y = mean, colour = "darkgreen", geom = "point", 
               shape = 18, size = 3, show.legend = FALSE)

ggplot(trainigdata, aes(x=k_symbol, y=amount))+ +
  geom_boxplot(outlier.colour="black", outlier.shape=16,outlier.size=2, notch=FALSE) +
  give_count +
  give_mean +
  scale_y_continuous(labels = comma) +
  labs(title="Loan Amount by status", x = "loan purpose", y = "Loan Amount \n")

## summary on training dataset

## t-test result
t.test(trainigdata$amount, testdata$amount)

t.test(trainigdata$amount, loanaccount$amount)

## making tree model from train data
train.loan <- tree(status~.-duration-date-payments-account_id, testdata)
text(train.loan, pretty=0)

## tree data prediction
treeloanprediction <- predict(train.loan,trainigdata, type = "class")

##logistic regression

lmloan <- glm(cbind(account_id,status)~.-payments,family="binomial", trainigdata)



predictlm <- predict(lmloan,newdata = testdata, type="response")
## confufusion matrix sensitivy, secifity
model_glm <- predict.glm(lmloan, testdata, type = "response", na.action = na.pass)
model_predict <- function(pred, t) ifelse (pred>t, TRUE, FALSE)
testdata <- testdata[complete.cases(testdata),]
caret::confusionMatrix(model_predict(model_glm, 0.5), reference = testdata, positive="TRUE")

## test set area under the curve

rocrpred <- prediction(model_glm, trainigdata$status)

pred <- prediction(predicttestdata,testdata$status)

as.numeric(performance(pred, "auc")@y.values)


First things first... A factor variable is a categorical variable. The levels of a factor variable is the possible categories, the value of the variable can fall in. E.g.:

> factor(sample(LETTERS, 10), levels = LETTERS)
 [1] M Y D B Z I G R T Q
Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Note how there are additional levels other than the value of the variable. Your error has to with you presumably comparing factor variables with different levels.

You can get the levels of a factor variable, like so:

> my_factor_var = factor(sample(LETTERS, 10), levels = LETTERS)
> my_factor_var
 [1] I V H X M L Z W Y E
Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
> levels(my_factor_var)
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

You can set the levels of a factor variable like so:

factor(my_factor_var, levels = unique(my_factor_var))
 [1] I V H X M L Z W Y E
Levels: I V H X M L Z W Y E

So bottomline - Try to look into factor variables and then check your confusion matrix again :slightly_smiling_face:


But in logistic regression the only binary (1,0) will work. Here status variable composite four factors A, B, C and D. So I am not sure about this model. Please, can you help me select the suitable model to identify loan defaulters for this case. Or this composite status variable will work with logistic regression?


I cannot tell you which model to use, you will have to look at your data and the question you want to ask your data. From what you've written it seems like you have to look into multi-class classification... As I wrote earlier you can try using the one-vs-all approach with logistic regression.