I'm doing an `Exploratory Data Analysis (EDA)`

for a given dataset.

```
29 variables = 23 categorical + 6 continuos
```

If I do:

```
library("DataExplorer")
plot_correlation(myds)
```

I get:

but the tool ignores many categorical variables with many values, as you can see below:

```
> plot_correlation(dataset_cat)
11 features with more than 20 categories ignored!
registrationDateDD: 31 categories
registrationDateHH: 24 categories
catVar_06: 138 categories
catVar_08: 1571 categories
catVar_09: 65 categories
catVar_10: 732 categories
catVar_11: 23 categories
catVar_12: 129 categories
city: 22604 categories
state: 54 categories
zip: 26458 categories
```

These are the categorical variables:

```
$ score : num 0 0 0 0 0 0 0 0 0 0 ...
$ catVar_01 : Factor w/ 2 levels ...
$ registrationDateMM : Factor w/ 9 levels ...
$ registrationDateDD : Factor w/ 31 levels ...
$ registrationDateHH : Factor w/ 24 levels ...
$ registrationDateWeekDay : Factor w/ 7 levels ...
$ catVar_06 : Factor w/ 140 levels ...
$ catVar_07 : Factor w/ 21 levels ...
$ catVar_08 : Factor w/ 1582 levels ...
$ catVar_09 : Factor w/ 70 levels ...
$ catVar_10 : Factor w/ 755 levels ...
$ catVar_11 : Factor w/ 23 levels ...
$ catVar_12 : Factor w/ 129 levels ...
$ catVar_13 : Factor w/ 15 levels ...
$ city : Factor w/ 22750 levels ...
$ state : Factor w/ 55 levels ...
$ zip : Factor w/ 26659 levels ...
$ catVar_17 : Factor w/ 2 levels ...
$ catVar_18 : Factor w/ 2 levels ...
$ catVar_19 : Factor w/ 3 levels ...
$ catVar_20 : Factor w/ 6 levels ...
$ catVar_21 : Factor w/ 2 levels ...
$ catVar_22 : Factor w/ 4 levels ...
$ catVar_23 : Factor w/ 5 levels ...
```

where: `{ MM: month, DD: day of the month, HH: hour }`

## My Question

**What do you think about:** In order to investigate for a possible correlation between categorical variables and the `score`

, for each categorical variable I'm going to group by it and then calculate the mean `score`

for the group. If there are significant changes between each mean, then most likely that categorical variable has some impact on the `score`

. What I'm trying to do here is try to figure out if it worth including one categorical variable or not on the model (`Neural Network`

). Later on, on the training phase, I can handle the several values for some of the categorical variables with: `One-hot encoding`

.

Does that makes sense? What I want to know is if I can get some useful information from those categorical variables with several values.

Thanks!