Hi,
I have a question about the procedure with the low representation of several levels of categorical variable.
I've build logit model which try to anwser question Does an internship during the period of education affect the probability of employment until 3/6/12 months after the end of education.
My independent variable is build from question : Did you have any form of internship during your education?
And there are 6 possibilities:
- no
- optional internship
- compulsory intership
- volunteer work
5 yes, during practical vocational training as part of classes - work in line with the field of education
7.work incompatible with the field of education
As you can see on screenshot there is a good representation of option nr. 1,3,5 and the other are quite low.
decoded these categorical variables into binary variables and built such a model
logit1_3_mies = glm(data$stan_do_3_mies ~ praktyk_zajecia + staz_obow + wolontariat + staz_n_obow + prac_zgod + prac_niezgod, data = data,family = 'binomial')
So the base level is anwser "no" but there are 3 binary variable with representation <100. I think it affects model a lot but I am not sure if I should just delete it. What would be the base level then? All missing categories? Cause I want to check diffrence betweeen intership and no intership. Would be nice to keep optional intership but the presention is so low