Categorical variables - not enough representation

I have a question about the procedure with the low representation of several levels of categorical variable.
I've build logit model which try to anwser question Does an internship during the period of education affect the probability of employment until 3/6/12 months after the end of education.

My independent variable is build from question : Did you have any form of internship during your education?
And there are 6 possibilities:

  1. no
  2. optional internship
  3. compulsory intership
  4. volunteer work
    5 yes, during practical vocational training as part of classes
  5. work in line with the field of education incompatible with the field of education

As you can see on screenshot there is a good representation of option nr. 1,3,5 and the other are quite low.
decoded these categorical variables into binary variables and built such a model
logit1_3_mies = glm(data$stan_do_3_mies ~ praktyk_zajecia + staz_obow + wolontariat + staz_n_obow + prac_zgod + prac_niezgod, data = data,family = 'binomial')

So the base level is anwser "no" but there are 3 binary variable with representation <100. I think it affects model a lot but I am not sure if I should just delete it. What would be the base level then? All missing categories? Cause I want to check diffrence betweeen intership and no intership. Would be nice to keep optional intership but the presention is so low

Is there anyone that can help me with stastistical knowladge?

Just go ahead and keep all the categories and see what happens. Then do a test to see if optional and compulsory are both different from zero.

I did 12 models, for 3months period, 6months period and 12months and divided them into whole data vs only higher education and then all 6 models into with control variable and without control variable ( that was current state, sex, education level, parents education) and the results are:

  1. For model with only this variabe there is statistical significance for all categories ( inlduing obligatory and compulsory)
  2. For models with control variables only obligatory intership is statistcial diffrent from 0

But the question is if such a small levels don't biased my Logit.

I am focusing more on statistical side of this casue I am intrested only with estiamtors and marginal efects but prediction accuracy of my model is sth around 60% without control variables and 75% with control variables.

The questions are levels with low representation, I feel like deleting this part from dateset is losing some information. Maybe some ROSE library and duplication? But hmm feels weird to duplicate a group of 12-100 people into 3000 :stuck_out_tongue: I've seen on YT where someone use ROSE library to make 1000 people into 50000 feels more legit.
BTW. That's my bachelor thesis so I want to do it right

Small levels don't bias the results. Getting unbiased results depends on the sample size, not the values of individual variables.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.