too much levels in the categorical variable in a GLM

I have 187 observations, the categorical variable is a predictor. My response variable is CPUE (catch per unit of effort). My goal is to know which of these variables (temperature, chlorophyll, depth, and bottom type) are most important for the capture of a specific species that I am analyzing. But I am struggling with this result where it appears that the null model is the most parsimonious. So I was wondering if is there a problem in adding a variable with so many levels to the model and why this (+) symbol is in the output always when the categorical variable appears. What does it mean? It also seems strange to me that no model with the categorical variable was never selected. This intuition of mine is also based on the response of a regression tree that I ran with this data and it appeared that the most explanatory variable was precisely the categorical variable that does not seem to have any relevance in the glm.

The categorical variable within the model it is a factor with 22 levels. I have searched and seen that the suggestion is to transform the levels into dummy variable but I don't think it is the way out once I would have to create 21 more columns and insert in the model... OBS: I already checked and the variable is a factor and not numeric.

mod0 <- glm(nCPUE ~ 1, data = bonaci, family=gaussian) #modelo nulo
mod1 <- glm(nCPUE ~ Depth, data = bonaci, family=gaussian) #depth
mod2 <- glm(nCPUE ~ Chlorophyll, data = bonaci, family=gaussian) #chlorophyll
mod3 <- glm(nCPUE ~ BottomType, data = bonaci, family=gaussian) #bottom type
mod4 <- glm(nCPUE ~ SST, data = bonaci, family=gaussian) #temperature
mod5 <- glm(nCPUE ~ Depth + Chlorophyll, data = bonaci, family=gaussian) 
mod6 <- glm(nCPUE ~ Depth + BottomType, data = bonaci, family=gaussian)
mod7 <- glm(nCPUE ~ Depth + SST, data = bonaci, family=gaussian)
mod8 <- glm(nCPUE ~ Chlorophyll + BottomType, data = bonaci, family=gaussian)
mod9 <- glm(nCPUE ~ Chlorophyll + SST, data = bonaci, family=gaussian)
mod10 <- glm(nCPUE ~ BottomType + SST, data = bonaci, family=gaussian)
mod11 <- glm(nCPUE ~ Depth * Chlorophyll, data = bonaci, family=gaussian)
mod12 <- glm(nCPUE ~ Depth * BottomType, data = bonaci, family=gaussian, na.action = "na.fail")
mod13 <- glm(nCPUE ~ Depth * SST, data = bonaci, family=gaussian)
mod14 <- glm(nCPUE ~ Chlorophyll * BottomType, data = bonaci, family=gaussian, na.action = "na.fail")
mod15 <- glm(nCPUE ~ Chlorophyll * SST, data = bonaci, family=gaussian)
mod16 <- glm(nCPUE ~ BottomType * SST, data = bonaci, family=gaussian, na.action = "na.fail")
mod17 <- glm(nCPUE ~ Depth + Chlorophyll + BottomType, data = bonaci, family=gaussian, na.action = "na.fail")
mod18 <- glm(nCPUE ~ Depth + Chlorophyll + SST, data = bonaci, family=gaussian, na.action = "na.fail")
mod19 <- glm(nCPUE ~ Chlorophyll + BottomType + SST, data = bonaci, family=gaussian, na.action = "na.fail")
mod20 <- glm(nCPUE ~ Depth * Chlorophyll * BottomType, data = bonaci, family=gaussian, na.action = "na.fail")
mod21 <- glm(nCPUE ~ Depth * Chlorophyll * SST, data = bonaci, family=gaussian, na.action = "na.fail")
mod22 <- glm(nCPUE ~ Chlorophyll * BottomType * SST, data = bonaci, family=gaussian, na.action = "na.fail")
mod23 <- glm(nCPUE ~ Depth + Chlorophyll + BottomType + SST, data = bonaci, family=gaussian, na.action = "na.fail")
mod24 <- glm(nCPUE ~ Depth * Chlorophyll * BottomType * SST, data = bonaci, family=gaussian, na.action = "na.fail")

library(MuMIn)
out.put <-model.sel (mod0, mod1, mod2, mod3, mod4, mod5, mod6, mod7, mod8, mod9, mod10, mod11, mod12, mod13, mod14, mod15, mod16, mod17, mod18, mod19, mod20, mod21, mod22, mod23, mod24)

out.put

Seems like this is the kind of thing where subject area expertise is important, down to having a reasonable idea of how different elements might relate, so that you can debug even your data if need be (let alone any models) ...

Your data at only 187 observations seems like it would be small enough to share, perhaps dput() it over here and people can play. (or host it as a project on github and send a link).

GLM automatically with turn factors into dummyvars (1 less than the factor levels).
I think the big problem to avoid is a dataleak whereby having too high levels / rare categories can approach a dataleak i.e. my example m6.


library(MuMIn)
library(tidyverse)
irisidc <- mutate(iris,row_id = factor(row_number()))

m1 <- glm(formula=
      Petal.Length ~ Petal.Width,data = irisidc)
m2<-glm(formula=
      Petal.Length ~ Petal.Width + Sepal.Length,data = irisidc)
m3<-glm(formula=
      Petal.Length ~ Petal.Width * Sepal.Length,data = irisidc)
m4<-glm(formula= 
      Petal.Length ~ Species,data = irisidc)
m5<-glm(formula=
      Petal.Length ~ Petal.Width * Sepal.Length +Species, data = irisidc)
m6 <- glm(formula=
            Petal.Length ~ row_id,data = irisidc)
MuMIn::model.sel(m1,m2,m3,m4,m5,m6)

Thank you for the clues... Its very strange for me the way that this categorical variable is shown in the analysis and the fact that the null model is always selected. I tried to add the categories as dummy variables but the model gets too heavy, it doesn't even run.

dput(bonaci)
structure(list(nCPUE = c(0.06, -0.48, 0.97, 1.03, 0.22, 
0.95, 1.48, 0.3, -0.03, -0.55, 1.56, -0.54, 0.32, 0, 0.35, 0.22, 
0.04, 0.57, -0.54, 0.93, 1.2, 0.66, 0.8, 1.26, 0.38, 0.54, -0.08, 
0.26, 1.26, 0.43, 0.07, 0.22, 0.69, 0.91, -0.36, 0.92, 0.92, 
1.14, 0.94, 1.3, 1.05, 0.77, 0.36, -0.23, 0.44, 0.87, 1.04, 0.56, 
0.31, 0.97, 0.8, -0.33, 0.88, 0.44, -0.01, 0.93, -0.65, 0.47, 
1.14, -0.15, 1.22, -0.14, 0.31, 0.47, 0.74, 0.4, -0.22, 1.04, 
0.6, 0.59, 0.72, 0.72, 0.34, 1.3, 0.98, 0.89, 0.48, 0.27, 1.15, 
1.12, -0.12, -0.83, 0.87, 0.6, -0.68, 1.13, 0.85, -0.14, -0.18, 
0, 0.6, -0.2, 0.82, 0.62, -0.08, 0.63, 0.99, 1.08, 1.02, 1.05, 
0.96, -0.92, -0.62, 0.43, 0.76, -0.61, 0.05, 1.07, 1.49, -0.54, 
0.25, 1.15, 0.6, 0.94, 1.03, 0.89, 0.99, -0.53, 0.25, 0.73, 0.59, 
1.63, 1.38, 1.2, 1, 0.51, 0.92, -0.19, -0.94, 0.79, 0.36, 0.2, 
0.3, 1.61, 0.23, 0.28, 0.47, 0.95, 1.8, 0.36, 0.13, 0.99, 0.43, 
0.12, 0.76, 0.09, 1.08, -0.67, -0.26, 1.27, 0.34, -0.02, 1.43, 
0, 1.55, 0.86, 1.37, 0.7, 1.1, 0.08, 0.68, 1.03, 0.59, -0.09, 
1.9, -0.31, 0.49, 0.04, 0.18, 0.09, 1.9, -0.6, 1.28, -0.22, -0.71, 
1.03, 0.24, -1.14, 1.48, 1.19, 0.79, -0.36, 0.99, -0.06, -0.71, 
0.72, 1.22), Depth= c(-20.83, -3383.19, -20.69, 
-79.25, -3992.5, -235.75, -187.47, -108.71, -66.91, -1067.26, 
-48.55, -34, -100.04, -46.64, -49.68, -2660.88, -119.21, -71.04, 
-40.01, -131.63, -51.76, -59.47, -54.75, -46.54, -57.57, -51.19, 
-37.41, -74.97, -6.7, -36.28, -49.54, -47.53, -36.52, -40.6, 
-53.4, -60.21, -52.53, -40.27, -35.44, -40.47, -65.76, -3772.34, 
-35.39, -52.48, -59.64, -50.29, -40.74, -63.28, -28.54, -45.45, 
-52.31, -53.08, -35.14, -51.53, -28.52, -59.67, -71.01, -20.66, 
-38.56, -45.02, -27.9, -29.44, -38.96, -50.52, -30.33, -36.02, 
-53.96, -62.95, -1407.4, -16.44, -45, -36.9, -39.1, -40.51, -63.61, 
-65.97, -17.43, -43.91, -55.16, -69.8, -19, -25.8, -11.62, -16.81, 
-23.88, -69.58, -96.43, -17.28, -13.16, -46.32, -55.2, -169.96, 
-6.12, -8.6, -21.19, -25.37, -42.45, -45.08, -52.31, -51.36, 
-50.41, -4.78, -4.62, -7.5, -19.31, -21.43, -22.35, -28.14, -36.55, 
-49.89, -46.06, -182.48, -58.91, -10.88, -6.11, -6.81, -15.9, 
-42.29, -41.92, -50.62, -45.91, -6.37, -7.65, -19.67, -21.28, 
-17.8, -25.46, -43.22, -37.95, -37.51, -8.07, -9.1, -14.26, -20.59, 
-27.55, -23.38, -25.09, -36.54, -55.55, -223.49, -11.33, -19.5, 
-26.54, -28.88, -31.27, -32.85, -47.07, -40.36, -7.33, -16.28, 
-23.11, -11, -19.43, -17.59, -24.32, -40.1, -29.59, -2247.71, 
-2712.64, -60.77, -154.09, -119.3, -37.86, -27.31, -37.99, -44.56, 
-32.51, -27.34, -18.78, -38.82, -413.07, -38.45, -40.61, -2.13, 
-36.77, -76.95, -34.1, -31.5, -40.77, -29.56, -33.08, -34.51, 
-79.07, -37.21, -155.36, -52.22, -18.62), Chlorophyll = c(0.8, 
0.07, 0.64, 0.51, 0.13, 0.05, 0.5, 0.14, 0.52, 0.6, 0.14, 0.11, 
0.1, 0.86, 0.77, 0.11, 0.42, 0.12, 0.12, 0.11, 0.13, 0.14, 0.12, 
0.1, 0.09, 0.1, 0.14, 0.1, 0.1, 0.14, 0.15, 0.15, 0.14, 0.14, 
0.14, 0.15, 0.15, 0.15, 0.15, 0.13, 0.11, 0.1, 0.13, 0.14, 0.16, 
0.15, 0.16, 0.11, 0.18, 0.14, 0.14, 0.15, 0.16, 0.16, 0.16, 0.11, 
0.1, 0.32, 0.15, 0.17, 0.17, 0.14, 0.16, 0.2, 0.18, 0.15, 0.11, 
0.1, 0.11, 0.9, 0.27, 0.18, 0.18, 0.15, 0.1, 0.1, 0.41, 0.22, 
0.13, 0.1, 0.31, 0.3, 0.39, 0.33, 0.27, 0.09, 0.09, 1.06, 0.37, 
0.15, 0.12, 0.08, 0.37, 0.45, 0.27, 0.24, 0.19, 0.15, 0.1, 0.09, 
0.12, 0.89, 0.51, 0.39, 0.25, 0.28, 0.26, 0.24, 0.19, 0.1, 0.11, 
0.09, 0.08, -198.73, 0.73, 0.51, 0.26, 0.1, 0.09, 0.08, 0.08, 
1.5, 0.52, 0.27, 0.25, 0.24, 0.17, 0.11, 0.13, 0.11, 0.54, 0.54, 
0.72, 0.3, 0.24, 0.23, 0.21, 0.14, 0.11, 0.1, 0.39, 0.31, 0.27, 
0.24, 0.2, 0.18, 0.14, 0.13, 0.44, 0.29, 0.25, 0.43, 0.28, 0.3, 
0.24, 0.17, 0.21, 0.1, 0.08, 0.09, 0.16, 0.08, 0.09, 0.19, 0.07, 
0.08, 0.16, 0.19, 0.51, 0.18, 0.17, 0.31, 0.24, 0.18, 0.41, 0.12, 
0.14, 0.11, 0.09, 0.1, 0.08, 0.07, 0.07, 0.15, 0.16, 0.15, 0.69
), BottomType = structure(c(18L, 14L, 19L, 5L, 14L, 14L, 
9L, 18L, 7L, 7L, 18L, 7L, 18L, 9L, 18L, 7L, 18L, 7L, 18L, 7L, 
14L, 4L, 14L, 14L, 14L, 18L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 
14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 19L, 10L, 14L, 11L, 
14L, 14L, 14L, 14L, 18L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 4L, 
14L, 14L, 1L, 14L, 1L, 14L, 10L, 14L, 18L, 14L, 14L, 14L, 14L, 
14L, 14L, 14L, 18L, 14L, 5L, 14L, 14L, 10L, 14L, 14L, 14L, 14L, 
14L, 14L, 14L, 14L, 14L, 5L, 14L, 15L, 14L, 5L, 4L, 15L, 17L, 
18L, 14L, 14L, 14L, 14L, 14L, 16L, 14L, 14L, 14L, 14L, 14L, 16L, 
15L, 14L, 14L, 14L, 14L, 18L, 14L, 14L, 2L, 14L, 14L, 5L, 3L, 
14L, 14L, 14L, 10L, 14L, 11L, 18L, 18L, 14L, 15L, 14L, 15L, 14L, 
15L, 6L, 21L, 14L, 9L, 19L, 3L, 18L, 2L, 14L, 14L, 14L, 5L, 22L, 
13L, 13L, 14L, 12L, 5L, 8L, 14L, 14L, 14L, 14L, 14L, 20L, 18L, 
15L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 
18L, 18L, 6L, 7L, 7L, 18L, 9L, 18L, 18L, 7L), .Label = c("CM", 
"CM+CSG+RB", "CM+RB", "CSG", "CSG+RB", "CSG+RB+SS+SR+TM", "ER", 
"ER+MSM+RB+SR", "ER+SR", "MSM", "MSM+RB", "MSM+RB+SR", "MSM+SR", 
"RB", "RB+SR", "RB+SS", "RB+SS+SR", "SR", "SS", "SS+SR", "SS+TM", 
"TM"), class = "factor"), SST = c(23.22, 25.52, 23.35, 
24.94, 25.8, 25.5, 25.32, 26.02, 25.32, 25.24, 26.01, 26.03, 
25.83, 24.33, 25.1, 26.13, 25.13, 25.85, 26.1, 25.82, 25.47, 
25.53, 25.71, 26, 25.97, 25.95, 25.77, 25.98, 26.08, 25.78, 25.75, 
25.86, 25.79, 25.69, 25.79, 25.85, 25.83, 25.73, 25.88, 25.84, 
26.08, 26.33, 25.61, 25.75, 25.9, 25.74, 25.72, 26.06, 25.78, 
25.74, 25.77, 25.89, 25.74, 25.74, 25.68, 25.82, 25.96, 25.77, 
25.78, 25.79, 25.81, 25.72, 25.74, 25.85, 25.88, 25.83, 25.89, 
26, 26.09, 25.74, 25.82, 25.75, 25.9, 25.93, 25.85, 25.98, 25.79, 
25.9, 25.89, 25.89, 25.82, 25.74, 25.87, 25.74, 25.86, 25.98, 
26.42, 25.69, 25.65, 25.98, 25.93, 26.42, 25.7, 25.69, 25.61, 
25.7, 25.9, 25.86, 25.88, 26.16, 26.26, 25.9, 25.86, 25.63, 25.63, 
25.68, 25.73, 25.82, 25.84, 26.17, 26.31, 26.51, 26.54, 25.83, 
25.87, 25.83, 25.63, 26.1, 26.29, 26.31, 26.4, 25.77, 25.78, 
25.6, 25.6, 25.55, 25.97, 26.34, 26.3, 26.36, 25.75, 25.62, 25.73, 
25.66, 25.64, 25.62, 25.68, 26.15, 26.33, 26.39, 25.66, 25.68, 
25.68, 25.64, 25.63, 25.77, 26.1, 26.2, 25.77, 25.66, 25.62, 
25.71, 25.69, 25.67, 25.76, 26.04, 25.82, 26.63, 26.59, 26.58, 
26.07, 26.59, 26.58, 25.93, 26.64, 26.56, 26.05, 26, 25.76, 26.02, 
26.06, 25.94, 25.94, 26.01, 25.82, 26.17, 26.3, 26.43, 26.65, 
26.59, 26.78, 26.72, 26.75, 26.24, 26.1, 26.26, 26.06)), row.names = c(NA, 
187L), class = "data.frame")

The data seems very random and there doesnt seem to be a strong signal for nCPUE from the noise.
You can try the following to eke out a tiny better signal perhaps, doing this gave me a couple models better than m0, but hardly by much.

bonaci <- mutate(bonaci,
                 Depth2 = (case_when(Depth <=-100 ~ -100,
                                    TRUE ~ Depth)),
                 Chlorophyll2 = case_when(Chlorophyll<=0 ~ 0,
                                          TRUE ~ Chlorophyll),
                 SST2 = case_when(SST<=25 ~ 25,
                                          TRUE ~ SST),
                 BottomType2 = forcats::fct_lump_min(BottomType,min = 5)
                 )

I'm sorry, but what exactly does this code you wrote do? The model apparently made a lot more sense to me and the categorical variable could be included in some way. Thanks!!!!

I basically exclude outliers in the numerics and drop rate levels in the factor

1 Like

You don't have too many levels, and this is expected behavior for the model selection function. I had to look at the help documentation for model.sel and then for model.selection.object. It says this about model terms

For numeric covariates these columns hold coefficent value, for factors their presence in the model. If the term is not present in a model, value is NA.

Thus, the + indicator in your table just means the variable is included in the model. It doesn't show all the coeffecients since there would be 21 of them.

1 Like

particularly in this part

BottomType2 = forcats::fct_lump_min(BottomType,min = 5)

it means that it left only 5 categorical variables and the rest was grouped in other?

There have to be a minimum of 5 occurrences of that category or they are grouped up with the other small ones.

1 Like

You probably have too many levels of BottomType (as already eluded too). What if you deconstructed it into a series of binaries for the various components? E.g. one variable for CM, one for CSG, ... etc. Then you would have about 8 variables rather than the 21 that are constructed through the dummy variables...

sapply(unique(unlist(strsplit(levels(bonaci$BottomType), "+", fixed = TRUE))), 
        function(x) {
            grepl(x, dat$BottomType)
        })

One of my choice is to put the less frequent levels in one category called "other". This leads to more stable coefficient estimates and better performances in predictive modeling. Furthermore (I don't know if it's your case) if you put the model in production and a new level appears you can simply map it as "other" avoiding problems. Recall the general rule to keep less parameters possible, so I advice in your case (few rows) to keep not more than 3-4 levels for each factor.