Trouble merging subsets by ID - one subset is different (individual vs household)

dplyr
rstudio
merge
#1

Hi all

I’m an R newbie working on secondary data and I need help, please. Trying to merge a selection of 6 subsets of numerous variables each but for the same group of individual persons - I had downloaded it from Nesstar (longitudinal study of individuals within family). Whilst merging I’m selecting chosen variables only, to create my final dataframe with around 18k rows and 6 columns. It's supposed to be 6 variables per each individual.

Thing is, every subset but one can be merged easily by variable 1 "ID number" of each individual person I need. The last one is suddenly about the whole family unit that lives in the household of the individual, and every person living in that household is given the ID of the individual.. so individual ID becomes household ID in this one subset only. Obviously when merging by ID the dataframe ended up being quite wildly increased in rows (from 18k to 76k).

In regular 5 subsets it looks like this (n=18k):

Var1  Var2  Var3 	Var4
ID1	   y   1999 	11
ID2 	n 	1988 	3
ID3 	y 	1977 	9
ID4 	y	1989	17
Etc etc etc

In the last 1 subset it suddenly looks like this (n=76k):

Var1   Var2 Var3 Var4
ID1 	y 	1999 	1
ID1     n 	1990	 8
ID1 	n 	1997 	11
ID2 	n 	1989 	3
ID2 	n 	1999 	6
ID3 	y 	1910 	9
ID3 	y 	1954   17
ID3 	y 	1999 	7
ID3 	y 	1977   17

The last subset is key for me, because a variable (Var4) that is there is needed to select my cases and controls from the final dataset of 18k. If the Variable 4 value is 17 I need to keep it for cases, if it’s anything else it’s a potential control, as I am doing propensity score matching. These individuals in the household are not given any individual ID, and are only entered into the dataframe via relation to the main individual. I can’t establish my cases sample size without using this Variable 4.

I’m at a loss as to how I can merge all 6 subsets into my final dataframe without losing the data that I need for case-control PSM.

Would anyone please advise?

0 Likes

#2

I'm not sure I understand what it is you're trying to achieve here but if all you need is a unique identifier for each individual (i.e. row) within your last subset and don't care about the household ID, could you not just do something like:

library(tidyverse)
last_subset <-
data.frame(stringsAsFactors = FALSE,
        Var1 = c("ID1", "ID1", "ID1", "ID2", "ID2", "ID3", "ID3", "ID3", "ID3"),
        Var2 = c("y", "n", "n", "n", "n", "y", "y", "y", "y"),
        Var3 = c(1999, 1990, 1997, 1989, 1999, 1910, 1954, 1999, 1977),
        Var4 = c(1, 8, 11, 3, 6, 9, 17, 7, 17)
)


last_subset %>%
  # maintain HouseholdID if important
  rename(HouseholdID = Var1) %>% 
  # Add unique ID for each individual/row
  mutate(Var1 = paste0('ID',row_number())) 
#>   HouseholdID Var2 Var3 Var4 Var1
#> 1         ID1    y 1999    1  ID1
#> 2         ID1    n 1990    8  ID2
#> 3         ID1    n 1997   11  ID3
#> 4         ID2    n 1989    3  ID4
#> 5         ID2    n 1999    6  ID5
#> 6         ID3    y 1910    9  ID6
#> 7         ID3    y 1954   17  ID7
#> 8         ID3    y 1999    7  ID8
#> 9         ID3    y 1977   17  ID9

Created on 2019-05-16 by the reprex package (v0.2.1)

1 Like