How does left_join, dplyr self-join with filter work ?

Hi All,
My question is inspired by this post:

I would like to understand how this two dataframes (one of them is a subset from the other) can be joined together if a first dataframe has got 15 rows:

vas = data_frame(
  user = rep(letters[1:n_users], each = 3),
  group = rep(c("baseline", "early", "late" ),n_users),
  vas = round(rgamma(n_users*3, 10,1.4 ))

and the other has got 5 rows:

vas_baseline <- vas_1 %>% filter(group == "baseline")


vas_1 = vas %>%
  mutate(vas = vas * 2)

vas_with_diff <-  vas_1 %>%
  left_join(vas_baseline, by = c("user")) %>%
  mutate(vas_difference = vas.x - vas.y)

We finally got a new dataframe with 15 rows and 6 columns (no NAs - surprisingly to me?):

I apologize fort a naive question, but would it be possible to see where all cells (values) from
"vas_baseline" are situated now in "vas_with_diff" ? Sort of like maybe conditional formatting in an Excel way.
What would have happened if "vas_baseline" had more columns than "vas_1" ?


its no na's because you only joined 'by user' rather than by user & group. So you get a value wherever there is a user match, which is everywhere

The generation of NA values as a result of a join is dependent on the joining keys, not the number of rows in the data frames being joined.

In the example, vas_1 and vas_baseline are being left joined using only the user variable. So, the variables in the right data frame (vas_baseline) will be joined to vas_1 wherever a match is found for user. Since both data frames contain all 5 possibilities (a to e), there are no NA values in the result.

If instead you added group to the join criteria, then you would get NA values, since the vas_baseline data frame contains only one group type.

left_join(vas_1, vas_baseline, by = c("user", "group"))
# A tibble: 15 x 4
   user  group    vas.x vas.y
   <chr> <chr>    <dbl> <dbl>
 1 a     baseline    14    14
 2 a     early       14    NA
 3 a     late        22    NA
 4 b     baseline    12    12
 5 b     early       10    NA
 6 b     late        18    NA
 7 c     baseline    22    22
 8 c     early       14    NA
 9 c     late        16    NA
10 d     baseline    16    16
11 d     early       10    NA
12 d     late        10    NA
13 e     baseline    16    16
14 e     early       16    NA
15 e     late        26    NA

I don't think there is a way to highlight matched rows (if I understand correctly what you mean). The rows which returned NA for the variables in the right data frame are the unmatched rows.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

Thank you All for your reply,
I have tried in Excel and below there is a result.
I hope that is correct:

If this is not correct, please let me know.
Or even this:

I wish I could do it in R.