Join selecting 3 variables

Hi there, I have 2 dataframes and I would like to join them by the 3 variables. (make, enfine and type) How to do a join of 3 variables at the same time?


df1<- data.frame(make= c("renault", "renault", "renault"), engine = c("1.6", "1.6", "1.5"), type = c("megane luxury", "megane priviledge", "megane luxe"))
df2<-  data.frame(make= c("renault", "renault", "renault", "renault"),engine = c("1.6", "1.6", "1.6", "1.5"), type = c("megane luxury", "megane base", "megane priviledge", "megane base"))

df3 <- rbind(df1, df2)

Hi there,
I need do a Join, not a rbind.
The result should be lines with coincidences in the make, engine and type variables!

I would suggest the join functions within dplyr. full_join will have all rows from both datasets; left_join will only have rows from first dataset, and right_join will only have rows from second dataset.

df1<- data.frame(make= c("renault", "renault", "renault"), engine = c("1.6", "1.6", "1.5"), type = c("megane luxury", "megane priviledge", "megane luxe"), Var1=1:3) 
df2<-  data.frame(make= c("renault", "renault", "renault", "renault"),engine = c("1.6", "1.6", "1.6", "1.5"), type = c("megane luxury", "megane base", "megane priviledge", "megane base"), Var2=1:4)

library(tidyverse)

df3 <- df1 %>%
  full_join(df2, by=c("make", "engine", "type"))

df3
#>      make engine              type Var1 Var2
#> 1 renault    1.6     megane luxury    1    1
#> 2 renault    1.6 megane priviledge    2    3
#> 3 renault    1.5       megane luxe    3   NA
#> 4 renault    1.6       megane base   NA    2
#> 5 renault    1.5       megane base   NA    4

Created on 2020-09-28 by the reprex package (v0.3.0)

Hi, that´s looks what I need. Now I have more complex example.
In this case, when I add the lenght variable, I get 0 rows, because there are no coincidences. I would like to add a tolerance of 50 units in the lenght variable. Is possible to add tolerances?

df1<- data.frame(make= c("renault", "renault", "renault"), engine = c("1.6", "1.6", "1.5"), type = c("megane luxury", "megane priviledge", "megane luxe"),gearbox= c("dsg", "edc", "manual"),
                 lenght = c("2000", "2500", "3000")) 

df2<-  data.frame(make= c("renault", "renault", "renault", "renault"),engine = c("1.6", "1.6", "1.6", "1.5"), type = c("megane luxury", "megane base", "megane priviledge", "megane base"), 
                  lenght = c("2030", "5000", "2502", "4000"), ID=1:4)

df1%>%
  inner_join(df2, by = c("make", "engine", "type","lenght"))

You can get that kind of result with fuzzyjoin, you have to define the functions to apply for each by column: for the make, engine and type columns you can use the == function as in a normal join (two rows match if they are equal), and for lenght you can look at the difference between the values.

df1<- data.frame(make= c("renault", "renault", "renault"), engine = c("1.6", "1.6", "1.5"), type = c("megane luxury", "megane priviledge", "megane luxe"),gearbox= c("dsg", "edc", "manual"),
                 lenght = c(2000, 2500, 3000)) 

df2<-  data.frame(make= c("renault", "renault", "renault", "renault"),engine = c("1.6", "1.6", "1.6", "1.5"), type = c("megane luxury", "megane base", "megane priviledge", "megane base"), 
                  lenght = c(2030, 5000, 2502, 4000), ID=1:4)
df1
df2


fuzzyjoin::fuzzy_inner_join(df1, df2,
                            by=c("make", "engine", "type","lenght"),
                            list(make=`==`,
                                 engine=`==`,
                                 type=`==`,
                                 lenght=function(x1,x2){abs(x1-x2) <= 50}))

Note that I removed the quotation marks ("") in the definition of the data.frames so that the column lenght is numeric and not string.

(PS: just in case that matters, the correct English spelling is "length")

Hi Alexis, thanksa lot. I have a problem when I try to aply some kind of tolerance for a new column named "wheelbase". It´s like I can not aply 2 functions for fuzzyjoin

``` r
df1<- data.frame(make= c("renault", "renault", "renault"), engine = c("1.6", "1.6", "1.5"), type = c("megane luxury", "megane priviledge", "megane luxe"),gearbox= c("dsg", "edc", "manual"),
                 
                 lenght = c(2010, 2500, 3000), wheelbase= c(3500:3500)) 



df2<-  data.frame(make= c("renault", "renault", "renault", "renault"),engine = c("1.6", "1.6", "1.6", "1.5"), type = c("megane luxury", "megane base", "megane priviledge", "megane base"), 
                  
                  lenght = c(2030, 5000, 2502, 4000), ID=1:4, wheelbase = c(3580:3580))

When I aply the below script I get an error:

fuzzyjoin::fuzzy_inner_join(df1, df2,

                        by=c("make", "engine", "type", "wheelbase", "lenght"),
                        
                        list(make=`==`,
                             
                             engine=`==`,
                             
                             type=`==`,
                             
                             wheelbase= function(x1,x2){abs(x1-x2) <= 60}),
                        
                        lenght = function(x1,x2){abs(x1-x2) <= 60}))

Created on 2020-09-29 by the reprex package (v0.3.0)

Ok Now I have seen the mistake in the parenthesis.
All is right :slight_smile: Thanks a lot

fuzzyjoin::fuzzy_inner_join(df1, df2,

                        by=c("make", "engine", "type", "wheelbase", "lenght"),
                        
                        list(make=`==`,
                             
                             engine=`==`,
                             
                             type=`==`,
                             
                             wheelbase= function(x1,x2){abs(x1-x2) <= 80},
                        
                        lenght = function(x1,x2){abs(x1-x2) <= 60}))

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.