Combining Two Rows using the Tidyverse

dlsweet · September 5, 2019, 3:01pm

I am working with a dataset of client visits and am having a trouble concerning missing data. I'm trying to get unique clients but am getting multiple rows returned for some clients. The issue I'm having trouble with uses data like the following:

Client        Gender      Race
   A             M        White
   A             NA       White
   B             F        African American
   B             F        NA

How would I write over the NA based on client code?

mfherman · September 5, 2019, 3:13pm

I'm not sure if I totally understand your desired output, but maybe it's something like this? For each client, fill NAs in the Gender and Race columns with the values above and then get distinct rows.

library(tidyverse)

df <- tribble(
  ~Client, ~Gender, ~Race,
  "A", "M", "White",
  "A", NA , "White",
  "B", "F", "African American",
  "B", "F", NA
)


df %>% 
  group_by(Client) %>% 
  fill(Gender, Race) %>% 
  distinct()
#> # A tibble: 2 x 3
#> # Groups:   Client [2]
#>   Client Gender Race            
#>   <chr>  <chr>  <chr>           
#> 1 A      M      White           
#> 2 B      F      African American

^{Created on 2019-09-05 by the reprex package (v0.3.0)}

dlsweet · September 5, 2019, 3:34pm

Thanks! That's exactly what I was trying to do great to know about the fill function. It didn't work for every single client so I'll have to figure out what's going on with the other rows that still have NA values.

mfherman · September 5, 2019, 3:38pm

Hmm, you might take a look a the ordering of the NAs in your data frame. The default of fill is to fill "down" (i.e. fill NAs with values from preceding rows). You could try fill(var, .direction = "downup") which will look both above and below for replacement values, if you know that each client only has one correct gender and race.

Here is an example where the NA in the race column is above the value you want to fill and so you need to specify the direction for it to fill appropriately.

library(tidyverse)

df <- tribble(
    ~Client, ~Gender, ~Race,
    "A", "M", "White",
    "A", NA , "White",
    "B", "F", NA,
    "B", "F", "African American"
  )

df %>% 
  group_by(Client) %>% 
  fill(Gender, Race) %>% 
  distinct()
#> # A tibble: 3 x 3
#> # Groups:   Client [2]
#>   Client Gender Race            
#>   <chr>  <chr>  <chr>           
#> 1 A      M      White           
#> 2 B      F      <NA>            
#> 3 B      F      African American

df %>% 
  group_by(Client) %>% 
  fill(Gender, Race, .direction = "downup") %>% 
  distinct()
#> # A tibble: 2 x 3
#> # Groups:   Client [2]
#>   Client Gender Race            
#>   <chr>  <chr>  <chr>           
#> 1 A      M      White           
#> 2 B      F      African American

^{Created on 2019-09-05 by the reprex package (v0.3.0)}

dlsweet · September 5, 2019, 3:59pm

I had to post the directions separately to get it to work like this:

df %>%
    group_by(Client) %>%
    fill(Gender, Race, .direction = 'up') %>%
    fill(Gender, Race) %>%
    distinct()

When I looked at the documentation for fill it didn't have 'downup' and it kept throwing an error. Thank you for all of your help!

mfherman · September 5, 2019, 4:15pm

Ah, looks like the the "downup" and "updown" options are in only in the dev version of tidyr:

github.com/tidyverse/tidyr

Update to PR #504. Use anonymous function instead of compose

tidyverse:master ← coolbutuseless:fill-downup-updown

opened 10:18AM - 04 Mar 19 UTC

coolbutuseless

+33 -17

Add option to fill() to both fill-down-then-up and fill-up-then-down. (Issue: ht…tps://github.com/tidyverse/tidyr/issues/505) (Note: This is a replacement for PR #504 which I broke oh-so-hard.) This is to replace a common idiom of mine, i.e. ```r df %>% group_by(group) %>% tidyr::fill(value, .direction = 'down') %>% tidyr::fill(value, .direction = 'up') %>% ungroup() ``` which could become ```r df %>% group_by(group) %>% tidyr::fill(value, .direction = 'downup') %>% ungroup() ``` Depending upon number of groups and number of variables to replace, the current duplicate call to fill() can be avoided, giving significant speed savings. ### A situation where I do this I have values only known at some particular time and I need to fill this value both forwards and backwards in time. ### A particular example I work with clinical trial data, which is often provided in multiple files. In the process of making a data set for analysis, particular information may only be recorded at certain events/times, but need to be filled forward/back in time throughout a related time period. It is only valid to fill up/down within certain groupings (e.g. subjects, day, part of study) - with lots of subjects and lots of groups, this filling can take a noticeable amount of time. Also filling may be done within different groupings for different variables. ### A simplified concrete example: ``` r suppressPackageStartupMessages({ library(dplyr) }) # Weight only recorded at event_type = 1, but considered # valid across the entire event_num. # If 'wt' not defined for a given event num, it may be # carried forwards from a prior run, or backwards from a following run df <- tibble::tribble( ~subject, ~time, ~event_type, ~event_num, ~wt, 1 , 1, 0, 1, NA, 1 , 2, 0, 1, NA, 1 , 3, 1, 1, 20, 1 , 4, 0, 1, NA, 1 , 5, 0, 1, NA, 1 , 1, 0, 2, NA, 1 , 2, 0, 2, NA, 1 , 3, 1, 2, NA, 1 , 4, 0, 2, NA, 1 , 5, 0, 2, NA, 1 , 1, 0, 3, NA, 1 , 2, 0, 3, NA, 1 , 3, 1, 3, 30, 1 , 4, 0, 3, NA, 1 , 5, 0, 3, NA, ) # fill wt down/up within the event_num for each subject, # then down/up within subject only. df %>% group_by(subject, event_num) %>% tidyr::fill(wt, .direction = 'down') %>% tidyr::fill(wt, .direction = 'up' ) %>% group_by(subject) %>% tidyr::fill(wt, .direction = 'down') %>% tidyr::fill(wt, .direction = 'up' ) %>% ungroup() #> # A tibble: 15 x 5 #> subject time event_type event_num wt #> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1 1 0 1 20 #> 2 1 2 0 1 20 #> 3 1 3 1 1 20 #> 4 1 4 0 1 20 #> 5 1 5 0 1 20 #> 6 1 1 0 2 20 #> 7 1 2 0 2 20 #> 8 1 3 1 2 20 #> 9 1 4 0 2 20 #> 10 1 5 0 2 20 #> 11 1 1 0 3 30 #> 12 1 2 0 3 30 #> 13 1 3 1 3 30 #> 14 1 4 0 3 30 #> 15 1 5 0 3 30 ``` Created on 2018-10-24 by the [reprex package](http://reprex.tidyverse.org) (v0.2.0).

Your solution works great with the current CRAN version -- if you want to use "downup" you can install the dev version from GitHub.

dlsweet · September 12, 2019, 4:15pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.