Figure out if including a categorical variable with many values

I'm doing an Exploratory Data Analysis (EDA) for a given dataset.

29 variables = 23 categorical + 6  continuos

If I do:

library("DataExplorer")
plot_correlation(myds)

I get:

image

but the tool ignores many categorical variables with many values, as you can see below:

> plot_correlation(dataset_cat)
11 features with more than 20 categories ignored!
registrationDateDD: 31 categories
registrationDateHH: 24 categories
catVar_06: 138 categories
catVar_08: 1571 categories
catVar_09: 65 categories
catVar_10: 732 categories
catVar_11: 23 categories
catVar_12: 129 categories
city: 22604 categories
state: 54 categories
zip: 26458 categories

These are the categorical variables:

 $ score                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ catVar_01            : Factor w/ 2 levels ...
 $ registrationDateMM   : Factor w/ 9 levels ...
 $ registrationDateDD        : Factor w/ 31 levels ...
 $ registrationDateHH        : Factor w/ 24 levels ...
 $ registrationDateWeekDay   : Factor w/ 7 levels ...
 $ catVar_06            : Factor w/ 140 levels ...
 $ catVar_07            : Factor w/ 21 levels ...
 $ catVar_08            : Factor w/ 1582 levels ...
 $ catVar_09            : Factor w/ 70 levels ...
 $ catVar_10            : Factor w/ 755 levels ...
 $ catVar_11            : Factor w/ 23 levels ...
 $ catVar_12            : Factor w/ 129 levels ...
 $ catVar_13            : Factor w/ 15 levels ...
 $ city                 : Factor w/ 22750 levels ...
 $ state                : Factor w/ 55 levels ...
 $ zip                  : Factor w/ 26659 levels ...
 $ catVar_17            : Factor w/ 2 levels ...
 $ catVar_18            : Factor w/ 2 levels ...
 $ catVar_19            : Factor w/ 3 levels ...
 $ catVar_20            : Factor w/ 6 levels ...
 $ catVar_21            : Factor w/ 2 levels ...
 $ catVar_22            : Factor w/ 4 levels ...
 $ catVar_23            : Factor w/ 5 levels ...

where: { MM: month, DD: day of the month, HH: hour }

My Question

What do you think about: In order to investigate for a possible correlation between categorical variables and the score , for each categorical variable I'm going to group by it and then calculate the mean score for the group. If there are significant changes between each mean, then most likely that categorical variable has some impact on the score . What I'm trying to do here is try to figure out if it worth including one categorical variable or not on the model (Neural Network). Later on, on the training phase, I can handle the several values for some of the categorical variables with: One-hot encoding.

Does that makes sense? What I want to know is if I can get some useful information from those categorical variables with several values.

Thanks!

Take a look at the signature for plot_correlation

plot_correlation(data, type = c("all", "discrete", "continuous"),
  maxcat = 20L, cor_args = list(), geom_text_args = list(),
  title = NULL, ggtheme = theme_gray(),
  theme_config = list(legend.position = "bottom", axis.text.x =
  element_text(angle = 90)))

You can increase maxcat to get more of the variables, but I'd avoid city and zip

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.