Are correlations between taget encoded variables meaningful?

Andrew_Mcmahon · December 23, 2022, 7:23pm

I love the embed package. It lead me to a question that I'm having a hard time answering on my own---are Pearson correlations between likelihood encoding variables meaningful? Are they meaningful in the sense that they can tell me something about the relationships between transformed factors and the target variable? Or is this not a good way to proceed?

data(iris)
library(tidymodels)
library(embed)


recipe(Sepal.Length~.,data=iris) %>% 
  step_lencode_glm(Species,outcome = vars(Sepal.Length)) %>% prep() %>% 
  bake(new_data=NULL) %>% cor()

For instance, the correlation of the species term to other variables in the example above. Is it meaningful?

technocrat · December 24, 2022, 1:26am

Probably not. All that seems to happen is that three different character values for Species are replaced with three different numeric values. Unless those values indicate some difference among Species that isn't already there, it's hard to see what has changed.

after is the return value of recipe before being sent on to cor.

> table(after$Species)

5.006 5.936 6.588 
   50    50    50 
> table(iris$Species)

    setosa versicolor  virginica 
        50         50         50 
>

system · January 14, 2023, 1:27am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.