Hierarchical and Nested NAs in Tidymodels

PrinterOnFire · July 30, 2021, 11:22am

Hi everyone!

I have a question (maybe more theoretical than practical, but I really need a practical solution)

I have a dataset with a bunch of "nested"/"hierarchical" NAs, meaning variables where respondents who answered "No" to the previous question skip the following one (generating an NA in the dataset). For the NA part, I decided to model the NA as a new variable (given that it is a piece of information).

The only problem is that I have some features which are dummies from the same variable, for eg:

Question K is a dummy where 100 respondents answered "No" so they have to skip question A
Question A is divided into A1, A2, A3 and all three are dummies, and I have 100 NAs for all three.

Thus, they share the exact amount of NAs and those missing values carry the same amount of info. There is my question: How can I handle it?

The approach which I have tried so far is the following:

Model the NA using "step_indicate_na" for only one of the connected variables (let's say A1), and then assigning "0" (I know that if the value is missing it could be assimilated to a 0) to all the missing values (all the NAs (using this function: Replace missing values with a constant · Issue #473 · tidymodels/recipes · GitHub)

The other approach that I thought (to avoid personalised functions) is to use "step_unknown" but it would create a bunch of identical columns and I don't know if it could create too much noise (maybe using after that a "step_corr" the problem will disappear, but I don't know for sure if it is the right approach).

Can someone help me? Thank you very much in advance!

system · August 20, 2021, 11:22am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.