Replace Specific NAs with Values from Another Row by Column

kpivert · March 12, 2020, 2:58pm

Hello RStudio Community,

I'm a long-time reader and first-time question asker and am hoping you may be able to point me to a method to address an unusual case (to me at least).

I'd like to replace NA values from one row with values from another row for certain columns in a data frame. The motivation is survey data where there are multiple responses (in various stages of completion) for several individuals that I'd like to condense into one record per individual.

I've attached a toy example and function (my_fun) below, which returns the desired output for one column at a time. What I'm having trouble with is expanding this into a function that I can purrr::map() on multiple records (rows) and specific columns. I've outlined a goal_fun to hopefully demonstrate what I'm trying to do (realize the rlang syntax is incorrect).

While I've done my due diligence on SO and RSC, I haven't found an answer to address this. I suspect there is an elegant solution that I'm not seeing, and would greatly appreciate any recommendations you may have.

Thanks for reading.

Kurtis

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr) 

(df <- tibble(
  guid = c("Bob_1", "Bob_2", "Judy"),
  like_candy = c(NA, "Yes", "No"),
  like_coffee = c(NA, "No", "Yes"),
  like_beer = c("Yes", "Yes", "No"),
  like_wine = c(NA, "Yes", NA)
))
#> # A tibble: 3 x 5
#>   guid  like_candy like_coffee like_beer like_wine
#>   <chr> <chr>      <chr>       <chr>     <chr>    
#> 1 Bob_1 <NA>       <NA>        Yes       <NA>     
#> 2 Bob_2 Yes        No          Yes       Yes      
#> 3 Judy  No         Yes         No        <NA>

my_fun <- function(dat, guid, like_col) {
  
  df <- dat
  g <- guid
  
  df %>%
    mutate_at(
      vars({{ like_col }}),
      ~case_when(
        guid == g ~
          df %>%
          filter(guid == "Bob_2") %>%
          pluck({{ like_col }}),
        TRUE ~ .data[[{{ like_col }}]]
      )
    )
}

# Expected Result
my_fun(df, "Bob_1", "like_wine")
#> # A tibble: 3 x 5
#>   guid  like_candy like_coffee like_beer like_wine
#>   <chr> <chr>      <chr>       <chr>     <chr>    
#> 1 Bob_1 <NA>       <NA>        Yes       Yes      
#> 2 Bob_2 Yes        No          Yes       Yes      
#> 3 Judy  No         Yes         No        <NA>

# This is the goal. For illustration purposes only.
goal_fun <- function(dat, target_record, source_record, cols_to_fix) {
  
  df <- dat
  c <- enquos(cols_to_fix)
  
  df %>%
    mutate_all(
      vars(!!!c),
      ~case_when(
        guid == target_record ~
        df %>%
          filter(guid == source_record) %>%
          pluck(!!!c), # Realize this is not proper syntax 
        TRUE ~ .data[[!!!c]] # Ditto
      )
    )
}

^{Created on 2020-03-12 by the reprex package (v0.3.0)}

aosmith · March 12, 2020, 3:09pm

While you may need something more complex, this is the kind of job that fill() from tidyr can be really useful for. For example, you could separate out the numbers from the id's with separate() and then fill columns within groups by using fill() on a grouped dataset.

In my example I chose to fill up and then down, but not sure if this will matter in your real case.

library(tidyr)
library(dplyr)

df %>%
     separate(guid, into = c("guid", "obs"), fill = "right" ) %>%
     group_by(guid) %>%
     fill(like_candy:like_wine, .direction = "updown")

# A tibble: 3 x 6
# Groups:   guid [2]
# guid  obs   like_candy like_coffee like_beer like_wine
# <chr> <chr> <chr>      <chr>       <chr>     <chr>    
# 1 Bob   1     Yes        No          Yes       Yes      
# 2 Bob   2     Yes        No          Yes       Yes      
# 3 Judy  NA    No         Yes         No        NA

kpivert · March 12, 2020, 4:08pm

Thanks very much for the quick reply (and reminder about fill()), @aosmith. This indeed returns the desired result. However, because the data frame has multiple duplicate records, I'd like to apply this fix only to the Bob responses and not to the multiple Judy responses (realize I should have included more than 1 Judy record in my reprex), if that makes sense.

Is there a way to apply your solution in that manner?

Again, thanks for your help.

Kurtis

dromano · March 12, 2020, 4:42pm

Just for clarification, do you mean you might have, say, several Bobs which may be distinct people, so you'd want to be able to fix those records associated with a specific Bob? Or do you mean you may simply want to deal with Judy in a way that's different from the way you deal with Bob? (Maybe that's the same question, twice?)

kpivert · March 12, 2020, 4:59pm

Thanks for your comment, @dromano, and apologies for the confusion.

It's the latter, and I hope the following clarifies my meaning. Because some of the missing responses are either entry errors or missing not at random I don't want to apply the fill() solution @aosmith nicely outlined to every individual, but only to certain individuals for certain columns. The dataset consists of 216 columns (question responses) and 71 responses (rows) from 38 individuals.

Again, I hope this clarifies my intention.

Thanks again for reading and your feedback.

Kurtis

aosmith · March 12, 2020, 5:59pm

Oh, I think I see. I wonder if you could use fill() somehow within your current approach, where you have defined certain identifiable groups along with the column names that you want filled.

Maybe update your original question to show a more extensive example? I'd also be curious to know how you've stored the information for groups that need the filling and which columns it is relevant to in each group.

dromano · March 13, 2020, 4:13am

Here's a possibility that's a mixture of your and @aosmith's code, and which allows you to tab-complete column choices if you pipe df to it:

library(tidyverse)
(df <- tibble(
  guid = c("Bob_1", "Bob_2", "Judy"),
  like_candy = c(NA, "Yes", "No"),
  like_coffee = c(NA, "No", "Yes"),
  like_beer = c("Yes", "Yes", "No"),
  like_wine = c(NA, "Yes", NA)
))
#> # A tibble: 3 x 5
#>   guid  like_candy like_coffee like_beer like_wine
#>   <chr> <chr>      <chr>       <chr>     <chr>    
#> 1 Bob_1 <NA>       <NA>        Yes       <NA>     
#> 2 Bob_2 Yes        No          Yes       Yes      
#> 3 Judy  No         Yes         No        <NA>

goal_fun <- 
  function(dat, target_record, source_record, cols_to_fix) {
    cols_to_fix <- enexpr(cols_to_fix)
    df %>% 
      mutate(
        paired = 
          case_when(
            guid %in% c(target_record, source_record) ~ 'paired_records',
            TRUE ~ guid
          )
      ) %>% 
      group_by(paired) %>% 
      fill(!!cols_to_fix, .direction = 'updown') %>% 
      ungroup() %>% 
      select(-paired)
  }

goal_fun(df, 'Bob_1', 'Bob_2', like_candy:like_coffee)
#> # A tibble: 3 x 5
#>   guid  like_candy like_coffee like_beer like_wine
#>   <chr> <chr>      <chr>       <chr>     <chr>    
#> 1 Bob_1 Yes        No          Yes       <NA>     
#> 2 Bob_2 Yes        No          Yes       Yes      
#> 3 Judy  No         Yes         No        <NA>
df %>% goal_fun('Bob_1', 'Bob_2', like_candy:like_coffee)
#> # A tibble: 3 x 5
#>   guid  like_candy like_coffee like_beer like_wine
#>   <chr> <chr>      <chr>       <chr>     <chr>    
#> 1 Bob_1 Yes        No          Yes       <NA>     
#> 2 Bob_2 Yes        No          Yes       Yes      
#> 3 Judy  No         Yes         No        <NA>

^{Created on 2020-03-12 by the reprex package (v0.3.0)}
Does this work in your general case?

kpivert · March 13, 2020, 1:38pm

@dromano, bingo!! I tested with the actual data and it works perfectly. Thanks very much to you and @aosmith for your help. I really appreciate it.

system · March 20, 2020, 1:38pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.