Hello I faced a huge problem nowadays.
I made a logistic regression model to classify something. This is my code :

####### setting threshold value to convert dependent variable into 0 and 1(till this point, it is continuous form)
q = quantile(df\$dependentvariable, 0.7)
df\$exposure = ifelse(df\$dependentvariable >= q, 1, 0) # if the value belongs to upper 30%, return 1 and if not, return 0

####### dividing dataset into train set and test set
part = caret :: createDataPartition(df\$dependentvariable, p = 0.7)
idx = as.vector(part[])
training = df[idx, ]
test = df[-idx, ] # 70% of data for train, 30% of data for test

####### model calibration
training_model = glm(dependentvariable ~ independent1 + independent2, data = training, family = binomial)
summary(training_model)

####### prediction
predict_model = predict(training_model, newdata = test, type = "response")

###### calculating model accuracy

tab = table(predict_model >= 0.7, test\$dependentvariable)
accuracy = sum(diag(tab))/sum(tab)*100

The code worked, but I'm not sure if I did correctly because the accuracy calculated by the measure was only 70%...I hope it is not the problem of preprocessing or data itself. So I wanna figure out these.
First of all, two 0.7 in the code, is it right that giving same threshold value in those place?
Secondly, Did I code correctly? Is it right the way I putting training data and test data respectively?
Thank you.

I see three 0.7 values in your code, all very different.
First, you split your continuous dependent variable into the top 30% (category 1) and the bottom 70% (category 0).
Second, you define your training data as 70% of the total population.
Third, you categorize test data into category 1 if the predicted probability of being in category 1 is >= 0.7.

That last step is questionable, though I am no expert. You already defined category 1 as being in the top 30% and the logistic fit finds the best coefficients for the independent variables to model that. Therefore, if predict_model is greater than 0.5, then that sample is more likely to be in category 1 than in category 0. I would expect that about 30% of your test data would have predict_model >= 0.5. By setting the threshold on predict_model to 0.7, you are choosing points that the model thinks are very likely to be in the top 30%, rather than simply "more likely than not", and probably under counting that population.

First of all, thank you. About the 0.7 in the middle, yeah I was not confused about it, but the first and the last ones.
So if I understood your advice correctly, those two places are not related to each other at all, right?
The reason I coded 0.7 as the threshold of predict_model was not that special, but just I thought I had to match with that of dependent variable. But you are telling that it is right setting the last value as 0.5, whatever the first threshold value is in typical cases? I think my case is just typical one.

Plus, I apologize asking you an additional question. Now I am calculating the confusion matrix, changing the first threshold like 0.5, 0.6, 0.7...but fixing the last one as 0.5.
At first I had decided certain value of the first one, but I found that if I do that, the model has very low accuracy. But as I change the value, the increase of accuracy was observed. I can't totally ignore this because the accuracy is too low if I don't change the threshold value. Although I consider other index like sensitivity, preciseness and so on, I thought at least certain value of accuracy should be secured...
So the point is, is it right changing the threshold value that divides dependent variable according to the performance of model?

Yes, I think that is right. The logistic model takes care of mapping the independent variables onto the correct incidence of 1 and 0. If the model predicts that a point is more likely to be a 1 than a 0, it should be counted as a zero. The threshold for "more likely" is 0.5.

No, you should not adjust the cut point to improve your result. Here is a quote from an article I found with a quick web search.

Nevertheless, all these approaches are preferable to performing several analyses and choosing that which gives the most convincing result. Use of this so called “optimal” cutpoint (usually that giving the minimum P value) runs a high risk of a spuriously significant result; the difference in the outcome variable between the groups will be overestimated, perhaps considerably; and the confidence interval will be too narrow. This strategy should never be used.

Here is a link to that brief article:

Within the last few days I cam across a discussion by statisticians about how one should never convert a continuous variable into a dichotomous variable. Unfortunately, I cannot find that now. The basic argument is that you are throwing away information. Are you sure you want to change your continuous variable into values of 1 and 0?

I got it. I'll review the material.
But about the question of yours, whether I want or not, isn't it converting dependent variable which is a continuous form into categorial form just necessary? I am solving the binary classification problem, and I decided to use binary logistic regression. And as I've known, dependent variable in logistic regression model should be input as categorial form. Umm...I'm not sure if you intended this, maybe this is because I didn't understand "throwing away information" you said.

Converting your dependent variable into a categorical form is required if you are going to do logistic regression. My question is really why are you doing logistic regression instead of linear regression of the continuous dependent variable?

Ahh. As I told you I identified this problem as classification, not the prediction.
It is extracting certain points with certain criterion among many points. And I decided the criterion as dependent variable I supposed to use.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.