As a PhD student embarking on a research project using the Youth Risk Behavior Surveillance (YRBS) dataset, I found myself grappling with a fundamental challenge—how to appropriately handle the complexities of a complex survey design within the tidymodels framework.
In my pursuit of developing predictive algorithms to identify factors associated with suicidality among adolescents in the United States, I delved into the tidymodels ecosystem. Inspired by Dr. Kuhn's tutorial on incorporating case weights using the hardhat package, I began to grasp the importance of accounting for factors beyond the case weights, such as primary sampling units (PSUs) and strata, to obtain accurate inferential results.
However, my research emphasis lies primarily on predictive performance rather than inferential statistics. This prompted a crucial dilemma—should I proceed with modeling while overlooking the complexities of the survey design, or should I endeavor to incorporate the design variables and weights despite their potential computational challenges? Especially because I am not entirely sure how to incorporate them using tidymodels.
Any thought or guidance?