Using left_join with if justification (condition has length >1 and only the first element will be used)

Hello everyone,

I am new to R and not sure if this question will be naive. What I want to achieve in my data is firstly do the matching based on age, gender and highest degree (using left_join in dplyr). Then, as there are some variables that don't have a corresponding value in the right data frame, for those data that "p_id == NA", we want to do the matching based on only age and gender attributes. You can find below my code (which is not working to achieve the function I describe). I am wondering if you know why I am getting the warning "the condition has length >1 and only the first element will be used"? If so, can you offer me some hints to achieve the function I wish to do? Thanks in advance :slight_smile:

matching_function <- function(demobel_df, monitor_data){
  matching_df <- monitor_data %>% select(p_id, AgeGroupMethodology, Gender, AgeExact, HighestDegreeCat, WegingPop)
  
  demobel_matched <- left_join(demobel_df, matching_df, by = c("ageGroup" = "AgeGroupMethodology", "genderN" = "Gender", "degree" = "HighestDegreeCat"))
  
  if (is.na(demobel_matched$p_id)) {
    demobel_matched <- left_join(demobel_matched, matching_df, by = c("ageGroup" = "AgeGroupMethodology", "genderN" = "Gender"))
  } else {
    demobel_matched <- demobel_matched
  }
  
  demobel_matched$ageDiff <- abs(demobel_matched$age - demobel_matched$AgeExact)
  
  #Order first according to id, then weight, then ageDiff
  demobel_matched<- demobel_matched[order(demobel_matched[,'personID'],demobel_matched[,'WegingPop'], demobel_matched[,'ageDiff']),]
  
  return(demobel_matched)
}

demobel_matched <- matching_function(demobel_adults, adult_individuals)

if looks for a length-one logical condition (from the help file), it does not operate row-by-row. One way to do what you're looking for:

matching_df1 <- monitor_data %>% 
   select(p_id, AgeGroupMethodology, Gender, AgeExact, HighestDegreeCat, WegingPop) %>%
   filter(!is.na(p_id))

demobel_matched1  <- left_join(demobel_df1, matching_df, by = c("ageGroup" = "AgeGroupMethodology", "genderN" = "Gender", "degree" = "HighestDegreeCat"))

matching_df2 <- monitor_data %>% 
   select(p_id, AgeGroupMethodology, Gender, AgeExact, HighestDegreeCat, WegingPop) %>%
   filter(is.na(p_id))
  
demobel_matched2 <-  left_join(demobel_matched, matching_df, by = c("ageGroup" = "AgeGroupMethodology", "genderN" = "Gender"))

demobel_matched <- rbind(demobel_matched1, demobel_matched2)

This approach splits the data into two dataframes based on whether p_id is missing, performs the joins, and binds them back together.

Thanks Will!

However, the p_id is generated after the left_bind so your code is not working for my case. But thanks to your tips/hints, I have got the solution and I will share my code here :slight_smile:

#Starting of the matching demoBel data to the MONITOR data for predicting individual activity chain
#Get the important variables for matching from the MONITOR data
matching_df <- adult_individuals %>% select(p_id, AgeGroupMethodology, Gender, AgeExact, HighestDegreeCat, WegingPop)

#Do the matching based on age, gender and the highest degree
demobel_matched <- left_join(demobel_adults, matching_df, by = c("ageGroup" = "AgeGroupMethodology", "genderN" = "Gender", "degree" = "HighestDegreeCat"))

#Most of the matchings are done, however, there are cases that "age + gender + highestDegree" combination are not available in the MONITOR data
#Hence, first filter out these individuals based on whether p_id is NA, then, delete the p_id, ageExact and WegingPop column that additionally added by previous left_join
demobel_unmatched <- demobel_matched %>% 
  filter(is.na(demobel_matched$p_id) == TRUE) %>% 
  select(-c(p_id, AgeExact, WegingPop))

#Do the same matching based on the age and gender combination
demobel_unmatched <- left_join(demobel_unmatched, matching_df, by = c("ageGroup" = "AgeGroupMethodology", "genderN" = "Gender"))

#Delete the additional HighestDrgree column added because of the left_join of two variables
demobel_unmatched <- demobel_unmatched %>% 
  select(-HighestDegreeCat)

#Bind the two tables into one table
demobel_matched <- rbind(demobel_matched, demobel_unmatched)
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.