Assistance for a somewhat new user, lots of questions, looking for someone to talk through my model design and help me understand what to do next

Hello! I'm a data analyst for a small state agency and I've been tasked with updating a previous analysis I did of the uninsured population in my state to incorporate COVID-19 job losses (ie: answer the question "how did the loss of employer sponsored health insurance affect the uninsured rates in my state?") I don't have anyone at work to talk through my issues or help me design my analysis and just found out I need to go out on medical leave next week, so I'm trying to power through this quickly. I apologize if this isn't the right place to post this, I'm just at my wits end trying to figure this out by myself and don't have time to teach myself everything like I normally do. For clarity, I need someone to talk to who is familiar with Rstudio as a whole, the tidyverse, leaflet/GIS packages, and using census data, in particular IPUMS data imported via ipumsr package.

So, I've created an ACS extract at the PUMA (public use microdata area) geography that has several demographic characteristics of interest (age, poverty level, race, citizenship, employment status) as well as all the health insurance related variables I thought might be helpful (Any health insurance coverage, public health coverage, private health coverage, employer-based coverage, direct=purchase coverage, VA coverage, indian-health-services coverage, medicaid, medicare). I've pulled this all into Rstudio using the ipumsr package and I've approached aggregating the person-level data up to the PUMA geography level several ways so far. The best approach I've found was to use a basic group_by() then weighted count() to generate summary estimates of the number of people with health coverage and without health coverage for each of the health insurance variables I listed earlier, using spread() to create two columns. Then I use a left_join() to add this to a table so that I have several columns that are aggregated totals of estimates for each health insurance category, by PUMA. There absolutely has to be a better way to do this, but this was the best I've come up with.

However, this only gives me part of what I need. I now need to figure out the best way to adjust these estimates for unlawfully present migrants and those with offers of employer-sponsored insurance (which I have an idea how to do both, but am unsure how to implement it in R). The eventual goal is to have maps showing the geographic distribution of uninsured individuals in my state pre-COVID job losses and post COVID job losses, with basic rough estimates of how the uninsured rate has changed. I have access to unemployment data, BLS labor force data, and have also downloaded relevant Current Population Survey data from IPUMS. But I need to understand better how to do the things I want to in R before I can even incorporate all of that. I've come up with a few questions that I think will help me sort through this if I have someone with experience I can talk to.

  1. If I am using IPUMS data, would it be better to exclude certain characteristics likely to be unlawfully present migrants OR to adjust aggregated totals in some fashion?

  2. What is the best way to aggregate the data up to PUMA level? The way I've devised is obviously incredibly clunky and inelegant.

  3. Once I have it aggregated, should I try to do maps in Rstudio or just import the aggregated dataset into ArcGIS INSIGHTS, which I have access to?

  4. I have a crosswalk file to transform the PUMA geography to county, which would allow me to then use the county-level unemployment and industry data to model how job losses in specific industry sectors would affect the uninsured rate (basically, taking the uninsured estimates, broken down by industry sector, and then looking at the job losses in each county in each sector to see where the most people are losing employer-based insurance and then also looking at eligibility for medicaid or qualified health plans on the exchange, with the goal of figuring out an estimate of who would still be left uninsured from that pool of people who lost employer-based health insurance). But I'm unsure how valid that would be since I'm taking a larger geography to a smaller geography, which introduces more error, and breaking it out over all these variables, which introduces even more error. I need someone to help me understand the best way to approach that. Should I work back the other way, trying to somehow aggregate the BLS and unemployment data from county to PUMA? From Metropolitan Statistical Area to PUMA?

I have all of what I've done so far saved in a basic R script that I can show to whoever can assist me. I'm very hesitant to put anything online that could tie back to me personally due to privacy concerns, but I can try to create a reprex from my code so far if that's necessary or helpful. I'll work on that today and hopefully some caring and patient soul sees this and wants to help me. I know there are ways to do everything I want to do in R but I'm so overwhelmed right now and the time crunch due to going out for surgery soon means I can't waste any more time. As you can see, I need a lot more help than a simple forum post and would obviously cite whoever helps me in the eventual publication of this analysis (which will be on my agency's website). I'm trying to do this in as transparent and open a way as possible so that other state agencies in other states could replicate my analysis or use it as a basis for their own.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.