Exploratory Data Analysis (EDA) with categorical variables with several values

I'm doing an Exploratory Data Analysis (EDA) including different Unsupervised Analyses techniques in order to select the right variables for the Supervised Analysis which will be done with Neural Networks (NN) . The variable to predict will be: score .

nr1 = nrow(myds)
nr2 = nrow(myds[myds$score != 0,])
nr1
nr2
cat(sprintf('Ratio of values under: "Score" different than 0: %.4f', nr2/nr1))
29 variables = 23 categorical + 6  continuos

Right now I'm focused on the categorical variables. By the way, I already opened a new topic for the continuos variables here.

I splitted the original dataset and right now I'm working with a dataset where all variables are categorical except the score, which I included here for obvious reasons.

When I run:

library("DataExplorer")
plot_correlation(myds)

I get:

image

where the score (variable to predict) is on the first row on the bottom (and first column on the left). I added some green points to highlight where I see some color change on the score row.

Here we have more info about our dataset:

 $ score                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ catVar_01            : Factor w/ 2 levels ...
 $ registrationDateMM   : Factor w/ 9 levels ...
 $ registrationDateDD        : Factor w/ 31 levels ...
 $ registrationDateHH        : Factor w/ 24 levels ...
 $ registrationDateWeekDay   : Factor w/ 7 levels ...
 $ catVar_06            : Factor w/ 140 levels ...
 $ catVar_07            : Factor w/ 21 levels ...
 $ catVar_08            : Factor w/ 1582 levels ...
 $ catVar_09            : Factor w/ 70 levels ...
 $ catVar_10            : Factor w/ 755 levels ...
 $ catVar_11            : Factor w/ 23 levels ...
 $ catVar_12            : Factor w/ 129 levels ...
 $ catVar_13            : Factor w/ 15 levels ...
 $ city                 : Factor w/ 22750 levels ...
 $ state                : Factor w/ 55 levels ...
 $ zip                  : Factor w/ 26659 levels ...
 $ catVar_17            : Factor w/ 2 levels ...
 $ catVar_18            : Factor w/ 2 levels ...
 $ catVar_19            : Factor w/ 3 levels ...
 $ catVar_20            : Factor w/ 6 levels ...
 $ catVar_21            : Factor w/ 2 levels ...
 $ catVar_22            : Factor w/ 4 levels ...
 $ catVar_23            : Factor w/ 5 levels ...

where: { MM: month, DD: day of the month, HH: hour }

When I run the plot_correlation command above, it shows some warnings:

> plot_correlation(dataset_cat)
11 features with more than 20 categories ignored!
registrationDateDD: 31 categories
registrationDateHH: 24 categories
catVar_06: 138 categories
catVar_08: 1571 categories
catVar_09: 65 categories
catVar_10: 732 categories
catVar_11: 23 categories
catVar_12: 129 categories
city: 22604 categories
state: 54 categories
zip: 26458 categories

Thinking about this

Before categorical variables get passed to the model, each of them need to be converted to multiple dummy variables. For example, for variable: State, if there are 55 levels, it will be converted to 54 dummy variables, which is a lot. On top of that, on the list above there are other categorical variables with much more levels.

My Questions

1- Is there any way to extract valuable information from these categorical variables with several levels?

2- What do you think about: In order to investigate for a possible correlation between categorical variables and the score, for each categorical variable I'm going to group by it and then calculate the mean score for the group. If there are significant changes between each mean, then most likely that categorical variable has some impact on the score. I don't know if this makes sense or not, I just figure that out after feeling frustrated when the command: plot_correlation didn't handle those categorical variables and I wanted somehow to get information from them.

3- What do you tihnk about the Binary Encoding for catgorical variables with many values? On this article, the author says: "With (for example) only three levels, the information embedded (with One-hot Encoding) becomes muddled. There are many collisions and the model can’t glean much information from the features. Just One-hot encode a column if it only has a few values. In contrast, Binary Encoding really shines when the cardinality of the column is higher - with the 50 US states, for example. Binary Encoding creates fewer columns than One-hot Encoding. It is more memory efficient. It also reduces the chances of dimensionality problems with higher cardinality."

Regarding to the Question 3 the point is that before deciding to include a categorical variable on the model, I would need to know if it worth it or not. That's why I would like to detect somehow any relation between categorical variables and the score on the unsupervised analysis phase. Of course, we could train several more Neural Networks with and without the categorical values and compare the errors but that's extra work that I would like to avoid.

Thanks for your attention!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.