left_join() does not provide the expected results

Hi, everyone
When I combine two tables (shown in as follows:) by using a simple criteria, I couldn't get the expected results.
Table 1:

df <- data.frame(time = c("2015-01-08 10:30:00", "2015-01-08 12:30:00", 
    "2015-01-08 13:00:00", "2015-01-11 10:00:00", "2015-01-11 11:30:00"), 
    NDVI = c(0.704001825073964, 0.664227174588314, 0.646672010715424, 
        0.68403345905421, 0.704133275285083))

Table 2:

df2 <- data.frame(time = c("2015-02-08 08:00:00", "2015-02-08 08:30:00", 
    "2015-02-08 09:00:00", "2015-02-08 09:30:00", "2015-02-08 10:00:00", 
    "2015-02-08 10:30:00", "2015-02-08 11:00:00", "2015-02-08 11:30:00", 
    "2015-02-08 12:00:00", "2015-02-08 12:30:00", "2015-02-08 13:00:00", 
    "2015-02-08 13:30:00", "2015-02-08 14:00:00", "2015-02-08 14:30:00", 
    "2015-02-08 15:00:00", "2015-02-08 15:30:00", "2015-02-08 16:00:00", 
    "2015-02-08 16:30:00", "2015-02-08 17:00:00", "2015-02-08 17:30:00"), 
    GPP = c(1.3473421335, 1.498190105, 1.6670079231, 1.795284301, 
        2.0013819933, 2.126760006, 2.8418947458, 2.8739453554, 
        5.3966772556, 3.7611236572, 7.0905163288, 5.6668725014, 
        8.6729397774, 9.7592945099, 9.3278844357, 6.486092329, 
        4.5773762465, 4.2562910318, 2.3596098423, 1.0550425053))

I code as follow:

result <- df %>%
left_join(df2, by=c("time" = "Date_Time"))

The result shows:

ata.frame(time = c("2015-01-08 10:30:00", "2015-01-08 12:30:00", 
    "2015-01-08 13:00:00", "2015-01-11 10:00:00", "2015-01-11 11:30:00"), 
    NDVI = c(0.704001825073964, 0.664227174588314, 0.646672010715424, 
        0.68403345905421, 0.704133275285083), GPP = c(NA, NA, 
        NA, NA, NA))
#>                  time      NDVI GPP
#> 1 2015-01-08 10:30:00 0.7040018  NA
#> 2 2015-01-08 12:30:00 0.6642272  NA
#> 3 2015-01-08 13:00:00 0.6466720  NA
#> 4 2015-01-11 10:00:00 0.6840335  NA
#> 5 2015-01-11 11:30:00 0.7041333  NA

I have no idea what the incorrect code is.
Could you please help me to correct it?
Thanks in advance.

There are multiple issues.

  1. df2 do not have a column named Date_Time. There is no way that code generated the output that you shared at the end of your question. It should fail with this error.

Error: Join columns must be present in data.
x Problem with Date_Time.

  1. The join of time column should lead to the result you got. The left table, df has times on 8th January, while right table has times on 8th February, and these are obviously different. So the join lead to no matching row in right table, and hence all values are coming as missing.
  2. This is not a coding issue, but I just wanted to point out that your code uses left_join, while thread title says inner_join. If you actually wanted to do inner_join on this data, it should lead you to zero rows.

Hope this helps.


NOTE

As you were told in last thread, please share minimal reproducible examples when asking a question, whenever possible. That means a stand alone code (meaning it will include library imports, data definitions and actual codes) which can generate same error that you are facing on any machine, and it should be such that it is small enough to isolate the issue instead of a huge code. Also, it needs to be shared in a copy-paste friendly manner, so anyone can copy paste from your thread and then try it on their own systems. What you shared is not reproducible. There is no library imports, and there is nothing called ata.frame in R. Please go through the example in reprex FAQ, it's quite easy to follow if you just go through it.

I understand which library you are using and it's a copying issue, and I can fix it, but we would expect you to do these. All of us in community are trying to help you in your issues, surely you can help them by solving these yourself?

(Just sharing as a feedback so that you can improve your next threads, hope you will take it frankly as such. If it still offends you, I apologise.)


Edit (in response to #3)

Can you please provide a proper reproducible example showing how are you getting these results? It's unexpected, and I get different if I run same code as you shared.

The result I get is just 2 NA corresponding to last two rows. This is correct, as last two entries in df$time are on 11th January. Hence these don't match times of 8th January as present in df2.

Hope this helps.


I'm so confused on your edit. If you run the following code in a fresh R session, are you not getting the expected results?

library(dplyr)

df_1 <- data.frame(time = c("2015-01-08 10:30:00", "2015-01-08 12:30:00", "2015-01-08 13:00:00", "2015-01-11 10:00:00", "2015-01-11 11:30:00"),
                   NDVI = c(0.704001825073964, 0.664227174588314, 0.646672010715424, 0.68403345905421, 0.704133275285083))

df_2 <- data.frame(time = c("2015-01-08 07:00:00", "2015-01-08 07:30:00", "2015-01-08 08:00:00", "2015-01-08 08:30:00", "2015-01-08 09:00:00", "2015-01-08 09:30:00", "2015-01-08 10:00:00", "2015-01-08 10:30:00", "2015-01-08 11:00:00", "2015-01-08 11:30:00", "2015-01-08 12:00:00", "2015-01-08 12:30:00", "2015-01-08 13:00:00", "2015-01-08 13:30:00", "2015-01-08 14:00:00", "2015-01-08 14:30:00", "2015-01-08 15:00:00", "2015-01-08 15:30:00", "2015-01-08 16:00:00", "2015-01-08 16:30:00"),
                   GPP = c(0, 1.5599170625, 2.5339026749, 6.956438303, 6.8692502975, 7.6814880371, 7.9598636627, 8.0170240402, 9.0688500404, 8.6675083637, 7.0284771919, 11.4835240841, 8.2856509686, 0, 5.3423130512, 5.9049360752, 6.8801484108, 7.2114927769, 29.7322113514, 3.9361767769))

obtained_result <- df_1 %>%
    inner_join(df_2, by="time")

expected_result <- data.frame(time = c("2015-01-08 10:30:00", "2015-01-08 12:30:00", "2015-01-08 13:00:00"),
                              NDVI = c(0.704001825073964, 0.664227174588314, 0.646672010715424),
                              GPP = c(8.017024, 11.483524, 8.285651))

all.equal(target=expected_result,
          current=obtained_result)

sessionInfo()
Outputs
> library(dplyr)

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

> 
> df_1 <- data.frame(time = c("2015-01-08 10:30:00", "2015-01-08 12:30:00", "2015-01-08 13:00:00", "2015-01-11 10:00:00", "2015-01-11 11:30:00"),
+                    NDVI = c(0.704001825073964, 0.664227174588314, 0.646672010715424, 0.68403345905421, 0.704133275285083))
> 
> df_2 <- data.frame(time = c("2015-01-08 07:00:00", "2015-01-08 07:30:00", "2015-01-08 08:00:00", "2015-01-08 08:30:00", "2015-01-08 09:00:00", "2015-01-08 09:30:00", "2015-01-08 10:00:00", "2015-01-08 10:30:00", "2015-01-08 11:00:00", "2015-01-08 11:30:00", "2015-01-08 12:00:00", "2015-01-08 12:30:00", "2015-01-08 13:00:00", "2015-01-08 13:30:00", "2015-01-08 14:00:00", "2015-01-08 14:30:00", "2015-01-08 15:00:00", "2015-01-08 15:30:00", "2015-01-08 16:00:00", "2015-01-08 16:30:00"),
+                    GPP = c(0, 1.5599170625, 2.5339026749, 6.956438303, 6.8692502975, 7.6814880371, 7.9598636627, 8.0170240402, 9.0688500404, 8.6675083637, 7.0284771919, 11.4835240841, 8.2856509686, 0, 5.3423130512, 5.9049360752, 6.8801484108, 7.2114927769, 29.7322113514, 3.9361767769))
> 
> obtained_result <- df_1 %>%
+     inner_join(df_2, by="time")
> 
> expected_result <- data.frame(time = c("2015-01-08 10:30:00", "2015-01-08 12:30:00", "2015-01-08 13:00:00"),
+                               NDVI = c(0.704001825073964, 0.664227174588314, 0.646672010715424),
+                               GPP = c(8.017024, 11.483524, 8.285651))
> 
> all.equal(target=expected_result,
+           current=obtained_result)
[1] TRUE
Session Info
> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.5.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.1.1
> 
> sessionInfo(package="dplyr")
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.5.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] C

attached base packages:
character(0)

other attached packages:
[1] dplyr_1.0.7

loaded via a namespace (and not attached):
[1] compiler_4.1.1  graphics_4.1.1  tools_4.1.1     utils_4.1.1    
[5] grDevices_4.1.1 stats_4.1.1     datasets_4.1.1  methods_4.1.1  
[9] base_4.1.1 

Here I've used inner_join as your expected results has just the common rows. But if you use left_join, it should NOT be all NA. If you continue to get so, please provide a reprex along with session information (run sessionInfo() on R console) for others to help, as it's beyond my understanding. Sorry.

1 Like

Oh, I made a mistake to create reprex for df2.
Let me correct here:

df2 <- data.frame(time = c("2015-01-08 07:00:00", "2015-01-08 07:30:00", 
    "2015-01-08 08:00:00", "2015-01-08 08:30:00", "2015-01-08 09:00:00", 
    "2015-01-08 09:30:00", "2015-01-08 10:00:00", "2015-01-08 10:30:00", 
    "2015-01-08 11:00:00", "2015-01-08 11:30:00", "2015-01-08 12:00:00", 
    "2015-01-08 12:30:00", "2015-01-08 13:00:00", "2015-01-08 13:30:00", 
    "2015-01-08 14:00:00", "2015-01-08 14:30:00", "2015-01-08 15:00:00", 
    "2015-01-08 15:30:00", "2015-01-08 16:00:00", "2015-01-08 16:30:00"), 
    GPP = c(0, 1.5599170625, 2.5339026749, 6.956438303, 6.8692502975, 
        7.6814880371, 7.9598636627, 8.0170240402, 9.0688500404, 
        8.6675083637, 7.0284771919, 11.4835240841, 8.2856509686, 
        0, 5.3423130512, 5.9049360752, 6.8801484108, 7.2114927769, 
        29.7322113514, 3.9361767769))

Even after correcting the dataset and code as you suggested, the result is the same as follow:

result <-  df %>%
           left_join(df2, by=c("time"))

result <- data.frame(time = c("2015-01-08 10:30:00", "2015-01-08 12:30:00", 
    "2015-01-08 13:00:00", "2015-01-11 10:00:00", "2015-01-11 11:30:00"), 
    NDVI = c(0.704001825073964, 0.664227174588314, 0.646672010715424, 
        0.68403345905421, 0.704133275285083), GPP = c(NA, NA, 
        NA, NA, NA))

The expected results are


result <- data.frame(time = c("2015-01-08 10:30:00", "2015-01-08 12:30:00", 
    "2015-01-08 13:00:00"), 
    NDVI = c(0.704001825073964, 0.664227174588314, 0.646672010715424), GPP = c(8.017024, 11.483524, 8.285651))

I have no idea what happened to my code.

Thanks again, @Yarnabrina !

No I didn't get the expected result, even though I used inner_join or left_join.

Here is the sessionInform ()

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] plyr_1.8.6 forcats_0.5.0 stringr_1.4.0 purrr_0.3.4 readr_1.3.1
[6] tidyr_1.1.2 tibble_3.0.3 tidyverse_1.3.0 lubridate_1.7.9 ggplot2_3.3.2
[11] dplyr_1.0.2

loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 pracma_2.3.3 cellranger_1.1.0 pillar_1.4.6
[5] compiler_4.0.2 dbplyr_1.4.4 tools_4.0.2 jsonlite_1.7.1
[9] lifecycle_0.2.0 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.7
[13] reprex_0.3.0 cli_2.0.2 DBI_1.1.0 rstudioapi_0.11
[17] yaml_2.2.1 haven_2.3.1 xml2_1.3.2 httr_1.4.2
[21] withr_2.3.0 fs_1.5.0 hms_0.5.3 generics_0.1.0
[25] vctrs_0.3.4 grid_4.0.2 tidyselect_1.1.0 glue_1.4.2
[29] R6_2.4.1 fansi_0.4.1 readxl_1.3.1 modelr_0.1.8
[33] blob_1.2.1 magrittr_1.5 scales_1.1.1 backports_1.1.10
[37] ellipsis_0.3.1 rvest_0.3.6 assertthat_0.2.1 colorspace_1.4-1
[41] stringi_1.5.3 munsell_0.5.0 broom_0.7.0 crayon_1.3.4

Dear @Yarnabrina ,
Now I found the point! Due to different time zone, the strange results came out. Now I could fix it.
Thanks indeed!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.