Joining dataframes

Hello, quite new to R

I have a query regarding joining. I have two data frames with 3 columns with exactly the same number of rows.

steps_hour - Id, ActivityHour, TotalSteps
cals_hour - Id, ActivityHour, Calories
Here is an example (although I have tried specifying the join using by = "Id" and "ActivityHour" too.

step_cal_2 <- 
  steps_hour %>% full_join(cals_hour)

I have tried joining them using all join functions (inner, left, right, full) on both the Id column and the ActivityHour column.I have tried converting the Activity hour to datetime using Lubridate. However, the whichever way I do it, the first columns is always duplicates itself multiple times:

According to the dplyr reference material online it says "For all joins, rows will be duplicated if one or more rows in the x matches with multiple rows in y" . I can't think of a solution around this apart from to delete the ActivityHours rows but then the data will be all but useless.

I have spent over a 12 hours on this now and will just have to use copy and paste in Excel as I have a deadline. But I am curious how to solve this so thanks in advance.

Can you post a small part of each data frame? You can use code like

dput(head(steps_hour, 15))

and post the output of that. Tune the number of rows, 15 in the case above, so that a join will result in a few rows. Post the output of the code between two rows that have three back ticks, like this:
```
Paste output here
```

Hi, yes, here is the output of the two individual datasets:

> dput(head(steps_hour, 15))
structure(list(Id = c(1503960366, 1503960366, 1503960366, 1503960366, 
1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 
1503960366, 1503960366, 1503960366, 1503960366, 1503960366), 
    ActivityHour = c("4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", 
    "4/12/2016 2:00:00 AM", "4/12/2016 3:00:00 AM", "4/12/2016 4:00:00 AM", 
    "4/12/2016 5:00:00 AM", "4/12/2016 6:00:00 AM", "4/12/2016 7:00:00 AM", 
    "4/12/2016 8:00:00 AM", "4/12/2016 9:00:00 AM", "4/12/2016 10:00:00 AM", 
    "4/12/2016 11:00:00 AM", "4/12/2016 12:00:00 PM", "4/12/2016 1:00:00 PM", 
    "4/12/2016 2:00:00 PM"), StepTotal = c(373, 160, 151, 0, 
    0, 0, 0, 0, 250, 1864, 676, 360, 253, 221, 1166)), row.names = c(NA, 
-15L), class = c("tbl_df", "tbl", "data.frame"))

dput(head(cals_hour, 15))
structure(list(Id = c(1503960366, 1503960366, 1503960366, 1503960366, 
1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 
1503960366, 1503960366, 1503960366, 1503960366, 1503960366), 
    ActivityHour = c("4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", 
    "4/12/2016 2:00:00 AM", "4/12/2016 3:00:00 AM", "4/12/2016 4:00:00 AM", 
    "4/12/2016 5:00:00 AM", "4/12/2016 6:00:00 AM", "4/12/2016 7:00:00 AM", 
    "4/12/2016 8:00:00 AM", "4/12/2016 9:00:00 AM", "4/12/2016 10:00:00 AM", 
    "4/12/2016 11:00:00 AM", "4/12/2016 12:00:00 PM", "4/12/2016 1:00:00 PM", 
    "4/12/2016 2:00:00 PM"), Calories = c(81, 61, 59, 47, 48, 
    48, 48, 47, 68, 141, 99, 76, 73, 66, 110)), row.names = c(NA, 
-15L), class = c("tbl_df", "tbl", "data.frame"))
> ```

Like this?

library(dplyr)

DF1 <- structure(list(Id = c(1503960366, 1503960366, 1503960366, 1503960366, 
                             1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 
                             1503960366, 1503960366, 1503960366, 1503960366, 1503960366), 
                      ActivityHour = c("4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", 
                                       "4/12/2016 2:00:00 AM", "4/12/2016 3:00:00 AM", "4/12/2016 4:00:00 AM", 
                                       "4/12/2016 5:00:00 AM", "4/12/2016 6:00:00 AM", "4/12/2016 7:00:00 AM", 
                                       "4/12/2016 8:00:00 AM", "4/12/2016 9:00:00 AM", "4/12/2016 10:00:00 AM", 
                                       "4/12/2016 11:00:00 AM", "4/12/2016 12:00:00 PM", "4/12/2016 1:00:00 PM", 
                                       "4/12/2016 2:00:00 PM"), 
                      StepTotal = c(373, 160, 151, 0, 
                                    0, 0, 0, 0, 250, 1864, 676, 360, 253, 221, 1166)), row.names = c(NA, 
                                                                                                     -15L), class = c("tbl_df", "tbl", "data.frame"))


DF2 <- structure(list(Id = c(1503960366, 1503960366, 1503960366, 1503960366, 
                             1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, 
                             1503960366, 1503960366, 1503960366, 1503960366, 1503960366), 
                      ActivityHour = c("4/12/2016 12:00:00 AM", "4/12/2016 1:00:00 AM", 
                                       "4/12/2016 2:00:00 AM", "4/12/2016 3:00:00 AM", "4/12/2016 4:00:00 AM", 
                                       "4/12/2016 5:00:00 AM", "4/12/2016 6:00:00 AM", "4/12/2016 7:00:00 AM", 
                                       "4/12/2016 8:00:00 AM", "4/12/2016 9:00:00 AM", "4/12/2016 10:00:00 AM", 
                                       "4/12/2016 11:00:00 AM", "4/12/2016 12:00:00 PM", "4/12/2016 1:00:00 PM", 
                                       "4/12/2016 2:00:00 PM"), 
                      Calories = c(81, 61, 59, 47, 48, 
                                   48, 48, 47, 68, 141, 99, 76, 73, 66, 110)), row.names = c(NA, 
                                                                                             -15L), class = c("tbl_df", "tbl", "data.frame"))
AllDat <- inner_join(DF1,DF2,by=c("Id","ActivityHour"))
AllDat
#> # A tibble: 15 x 4
#>            Id ActivityHour          StepTotal Calories
#>         <dbl> <chr>                     <dbl>    <dbl>
#>  1 1503960366 4/12/2016 12:00:00 AM       373       81
#>  2 1503960366 4/12/2016 1:00:00 AM        160       61
#>  3 1503960366 4/12/2016 2:00:00 AM        151       59
#>  4 1503960366 4/12/2016 3:00:00 AM          0       47
#>  5 1503960366 4/12/2016 4:00:00 AM          0       48
#>  6 1503960366 4/12/2016 5:00:00 AM          0       48
#>  7 1503960366 4/12/2016 6:00:00 AM          0       48
#>  8 1503960366 4/12/2016 7:00:00 AM          0       47
#>  9 1503960366 4/12/2016 8:00:00 AM        250       68
#> 10 1503960366 4/12/2016 9:00:00 AM       1864      141
#> 11 1503960366 4/12/2016 10:00:00 AM       676       99
#> 12 1503960366 4/12/2016 11:00:00 AM       360       76
#> 13 1503960366 4/12/2016 12:00:00 PM       253       73
#> 14 1503960366 4/12/2016 1:00:00 PM        221       66
#> 15 1503960366 4/12/2016 2:00:00 PM       1166      110

Created on 2021-12-11 by the reprex package (v2.0.1)

Since

DF1$Id == DF2$Id
DF1$ActivityHour == DF@$ActivityHour
DF1$Calories <-   DF2$Calories

# or

DF3 <-   cbind(DF1, DF2[, 3])

That's amazing, thank you so much. This works perfectly.
I didn't think to join on two variables.
I shall mark as solved,
All the best,
Daniel

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.