how to use a pipe in bind_rows ?

von_olaf · November 6, 2019, 1:29pm

Consider this example

masterdf <- tibble(key = c('a', 'a', 'b'))

db1 <- tibble(key = c('a', 'a', 'c'),
              value = c(1,2,3))

db2 <- tibble(key = c('a', 'a', 'c'),
              value = c(10, 20, 30))

masterdf %>% do(bind_rows(. %>% left_join(db1, by = 'key'),
                          . %>% left_join(db1, by = 'key')))

Error: Argument 1 must be a data frame or a named atomic vector, not a fseq/function

I do not understand what is the issue here. Why can't I use that syntax? I just want to concatenate vertically the two independent left_merges of master with db1 and db2

Any ideas? Thanks!

valeri · November 6, 2019, 2:35pm

Hi @von_olaf,

I think you have over-engineered it. Isn't that what you would like to have?

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

masterdf <- tibble(key = c('a', 'a', 'b'))

db1 <- tibble(key = c('a', 'a', 'c'),
                            value = c(1,2,3))

db2 <- tibble(key = c('a', 'a', 'c'),
                            value = c(10, 20, 30))

masterdf <- masterdf %>% 
    left_join(db1, by = 'key') %>% 
    left_join(db2, by = 'key')

masterdf
#> # A tibble: 9 x 3
#>   key   value.x value.y
#>   <chr>   <dbl>   <dbl>
#> 1 a           1      10
#> 2 a           1      20
#> 3 a           2      10
#> 4 a           2      20
#> 5 a           1      10
#> 6 a           1      20
#> 7 a           2      10
#> 8 a           2      20
#> 9 b          NA      NA

Created on 2019-11-06 by the reprex package (v0.3.0)

von_olaf · November 6, 2019, 2:42pm

thanks @valeri but as you can see this does not work well. I expect 4 rows in total (two matches from db1 and two matches from db2) and you have 9. The reason your solution does not work well is that it also matches the matches from the first match Does that make sense?

Thanks!

Yarnabrina · November 6, 2019, 3:47pm

I don't follow why do you expect 2 + 2 = 4 rows.

As far as I understand left_join, the first one should have 5 rows (2 matches corresponding each a and one missing for b), and same for the second one. Hence, I'd expect total 10 rows and that's what the following gives:

bind_rows(left_join(masterdf, db1, by = "key"), left_join(masterdf, db2, by = "key"))

Can you share your expected output? And, why not share a reprex with all library calls?

von_olaf · November 6, 2019, 3:55pm

ha you are right, I a apologize. Yes, I expect 10 rows and my question is how to run the bind_rows you wrote using . %>% instead of explicitely calling masterdf %>%

valeri · November 7, 2019, 8:24am

Not sure whether that brings you closer, but here is another way of doing what @Yarnabrina suggest in a more 'purr'-y way that allows you to do this with an arbitrary list of data frames:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

masterdf <- tibble(key = c('a', 'a', 'b'))

db1 <- tibble(key = c('a', 'a', 'c'),
                            value = c(1,2,3))

db2 <- tibble(key = c('a', 'a', 'c'),
                            value = c(10, 20, 30))


purrr::map_df(.x = list(db1, db2), .f = ~left_join(masterdf, .x, by = 'key') )
#> # A tibble: 10 x 2
#>    key   value
#>    <chr> <dbl>
#>  1 a         1
#>  2 a         2
#>  3 a         1
#>  4 a         2
#>  5 b        NA
#>  6 a        10
#>  7 a        20
#>  8 a        10
#>  9 a        20
#> 10 b        NA

Created on 2019-11-07 by the reprex package (v0.3.0)

Pritishd · November 16, 2019, 6:39am

Hi von_olaf,

The 10 rows that you have requested have duplicates as well. Could you explain on how you will use duplicates in your actual application?

I have put up the following code which will provide you with an output that will be the superset of all 3 data sets as well as one with rows matching key values from masterdf

Superset

masterdf <- tibble(key = c('a', 'a', 'b'))

db1 <- tibble(key = c('a', 'a', 'c'),
              value = c(1,2,3))

db2 <- tibble(key = c('a', 'a', 'c'),
              value = c(10, 20, 30))

DF<-full_join(x=masterdf,y=db1,by="key") %>%
    full_join(y=db2,by="key") %>%
    mutate("master_key"=paste0(key,value.x,value.y),
           "dupl_row"=duplicated(master_key)) %>%
    filter(dupl_row!=TRUE) %>%
    select(-master_key,-dupl_row)

print(DF)

Output of the above code

 A tibble: 6 x 3
  key   value.x value.y
  <chr>   <dbl>   <dbl>
1 a           1      10
2 a           1      20
3 a           2      10
4 a           2      20
5 b          NA      NA
6 c           3      30

Only the values from the key of masterdf

masterdf <- tibble(key = c('a', 'a', 'b'))

db1 <- tibble(key = c('a', 'a', 'c'),
              value = c(1,2,3))

db2 <- tibble(key = c('a', 'a', 'c'),
              value = c(10, 20, 30))

DF<-full_join(x=masterdf,y=db1,by="key") %>%
    full_join(y=db2,by="key") %>%
    mutate("master_key"=paste0(key,value.x,value.y),
           "dupl_row"=duplicated(master_key)) %>%
    filter(dupl_row!=TRUE) %>%
    select(-master_key,-dupl_row) %>%
    filter(key %in% c('a', 'a', 'b'))

print(DF)

Output of the above code

# A tibble: 5 x 3
  key   value.x value.y
  <chr>   <dbl>   <dbl>
1 a           1      10
2 a           1      20
3 a           2      10
4 a           2      20
5 b          NA      NA

Warm Regards,
Pritish

system · December 7, 2019, 6:39am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.