Logistic regression; independent categorial variables

Hi. I am doing a logistic regression. One of my independent variables V is categorical with values 0, 1, 2, 3. The model creates variables V1, V2, V3.

  1. I understand the meaning of the value of the parameter estimate β1 for V1 as the increase in the odds ratio by a factor of exp( β1). Is this correct?
  2. Is the meaning of the value of the parameter estimate β2 for V2 as the additional increase in the odds ratio by a factor of exp( β2), over and above exp( β1)? Is this correct?
  3. Is the meaning of the value of the parameter estimate β3 for V3 as the addidtional increase of the odds ratio ovre and above exp( β2)? Is this correct?
  4. Suppose the parameter estimate β3 for V3 is significant as measured by its z value, but parameters β1 and β2 are not significant. Does it make sense to eliminate V1 and V2 from the regression, or is something lost by doing so?

Thank you

I am not an expert in statistics. Please treat my answers as an attempt to be helpful by some random person on the internet.

  1. Your statement matches my memory of logistic regression but it has been a few years since I dealt with it.
  2. I disagree with this. The coefficient for V2 represents the effect on the log odds when the sample is in the V2 category as compared to the baseline case when the sample is in the V0 category. That is, each coefficient represents the effect of moving from V0 to Vx.
  3. Same as answer 2.
  4. I am not sure what you mean by eliminating V1 and V2. If the categories are non-overlapping, each sample can only be in one category. If you are dealing with fruit, each one is either an apple or a pear or an orange. Do you mean to change the categories to orange/not-orange. I suspect I have misunderstood you.

Thank you.

What if I made V into four dummy variables and then eliminated V1 and V2, and possibly V0? Would that make sense?

Okay, I understand what you mean. I don't think there is any way to say, based only on z values, whether variables should be kept in the model. The best method of deciding might be to fit the alternative models against one set of data and see how they perform against an independent set. Could you do 5 or 10-fold cross validation?

You might be able to simply run the combinations but you can still suffer from order effects here and then essentially create a very manual stepwise procedure (forward or backwards).

I am not sure on your field of study etc but many advise against the above mentioned processes. Significant or not cannot be the only criteria to determine inclusion. I see there is a version of all/best subset regression for logistic regression. See here: https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html

Grey, thanks for alerting me to best subset regression.

In logistic regression with bestGLM, is it necessary that categorical variables with more than two factor levels be reduced to two factor levels? Why?

And if so, would I change the levels but leave the values as is? Thank you.

@fcas80, without a real example it is difficult to say what can and cannot. I am assuming the reason here is that with a categorical variable (i.e. an actual nominal variable where levels are not directly comparable) it cannot estimate the variable as is given levels are not directly comparable and in addition the presence and absence of some might have far larger effects in predicting the outcome and or strength/importance of that level against the model itself.

@fcas80 just to add an additional point and echo @FJCC's post - definitely make use of cross fold validation as that will also give you a sense of the range of these estimates. Foremost, you should test for certain statistical assumptions but it is always good to perform a cross fold validation too.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.