Hi everyone!
I have a question (maybe more theoretical than practical, but I really need a practical solution)
I have a dataset with a bunch of "nested"/"hierarchical" NAs, meaning variables where respondents who answered "No" to the previous question skip the following one (generating an NA in the dataset). For the NA part, I decided to model the NA as a new variable (given that it is a piece of information).
The only problem is that I have some features which are dummies from the same variable, for eg:
- Question K is a dummy where 100 respondents answered "No" so they have to skip question A
- Question A is divided into A1, A2, A3 and all three are dummies, and I have 100 NAs for all three.
Thus, they share the exact amount of NAs and those missing values carry the same amount of info. There is my question: How can I handle it?
The approach which I have tried so far is the following:
- Model the NA using "step_indicate_na" for only one of the connected variables (let's say A1), and then assigning "0" (I know that if the value is missing it could be assimilated to a 0) to all the missing values (all the NAs (using this function: Replace missing values with a constant · Issue #473 · tidymodels/recipes · GitHub)
The other approach that I thought (to avoid personalised functions) is to use "step_unknown" but it would create a bunch of identical columns and I don't know if it could create too much noise (maybe using after that a "step_corr" the problem will disappear, but I don't know for sure if it is the right approach).
Can someone help me? Thank you very much in advance!