Associate info between two data.frames

PalomaLlorente · October 29, 2019, 12:29pm

I have a data.frame (call in the example "segments") with info related to a certain name (call "segments$seg_ID"). Each name has several information attached in columns. In other data.frame (call "trayectorias"), I have those names ("segments$seg_ID") now called ("trayectorias$seg_ID_trayectorias") . What I would like to do is associate info from the 1st data.frame ("segments") to the 2nd one, depending on the name. Important to note: in the 2nd data.frame ("trayectorias"), "segments$seg_ID" are most of the times repeated, that is to say, I have around 24000 rows in the "segments$seg_ID" data.frame and about 95000 in the "trayectorias$seg_ID_trayectorias" one.

To present my issue I have created a short reprex:

segments<-data.frame(stringsAsFactors=FALSE,
seg_ID = c("%%EDH_WSN", "%DIPA_PITES", "%DIPI_LADAT",
"%DRSI_SITET", "*%200_BAKER", "**EG1_MID",
"**TNT_IBUGO"),
Rumbo_circular = c(53, 297, 299, 335, 137, 321, 336),
)

trayectorias<-data.frame(stringsAsFactors=FALSE,
seg_ID_trayectorias = c("%%EDH_WSN", "%DIPA_PITES", "%%EDH_WSN", "%DRSI_SITET",
"%200_BAKER", "**EG1_MID", "%200_BAKER"),
)

Solution would be:

solution<-data.frame(stringsAsFactors=FALSE,
seg_ID_trayectorias = c("%%EDH_WSN", "%DIPA_PITES", "%%EDH_WSN", "%DRSI_SITET",
"%200_BAKER", "**EG1_MID", "%200_BAKER"),
Rumbo_circular = c(53, 297, 53, 335, 137, 321, 137)
)

As it can be seen, "segments$seg_ID" names were searched in the "trayectorias$seg_ID_trayectorias" data.frame and info from the "segments" data.frame ("segments$Rumbo_circular") was associated to each name.

Yarnabrina · October 29, 2019, 1:10pm

If I understand correctly, this is a wrong description of the problem. It's actually the reverse.

Also, this is not a reprex. That is supposed to be reproduce, where your code creates these errors:

Error in data.frame(stringsAsFactors = FALSE, seg_ID = c("%%EDH_WSN", :
argument is missing, with no default

Error in data.frame(stringsAsFactors = FALSE, seg_ID_trayectorias = c("%%EDH_WSN", :
argument is missing, with no default

Finally, do you want to match "%200_BAKER" with "*%200_BAKER", or is missing of * is a typo? I assumed that this match is not required and hence I get NA for those cases.

You can do dplyr::left_join like below:

segments <- data.frame(stringsAsFactors = FALSE,
                       seg_ID = c("%%EDH_WSN", "%DIPA_PITES", "%DIPI_LADAT", "%DRSI_SITET", "*%200_BAKER", "**EG1_MID", "**TNT_IBUGO"),
                       Rumbo_circular = c(53, 297, 299, 335, 137, 321, 336))

trayectorias<-data.frame(stringsAsFactors = FALSE,
                         seg_ID_trayectorias = c("%%EDH_WSN", "%DIPA_PITES", "%%EDH_WSN", "%DRSI_SITET", "%200_BAKER", "**EG1_MID", "%200_BAKER"))

dplyr::left_join(x = trayectorias,
                 y = segments,
                 by = c("seg_ID_trayectorias" = "seg_ID"))
#>   seg_ID_trayectorias Rumbo_circular
#> 1           %%EDH_WSN             53
#> 2         %DIPA_PITES            297
#> 3           %%EDH_WSN             53
#> 4         %DRSI_SITET            335
#> 5          %200_BAKER             NA
#> 6           **EG1_MID            321
#> 7          %200_BAKER             NA

^{Created on 2019-10-29 by the reprex package (v0.3.0)}

PalomaLlorente · October 30, 2019, 8:36am

Sorry, there was a mistake in the reprex, I add unintentionally a comma. Now I think it's correct:

segments<-data.frame(stringsAsFactors=FALSE,
      seg_ID = c("%%EDH_WSN", "%DIPA_PITES", "%DIPI_LADAT",
                 "%DRSI_SITET", "*%200_BAKER", "**EG1_MID","**TNT_IBUGO"),
      
      Rumbo_circular = c(53, 297, 299, 335, 137, 321, 336)
  )

trayectorias<-data.frame(stringsAsFactors=FALSE,
   seg_ID_trayectorias = c("%%EDH_WSN", "%DIPA_PITES", "%%EDH_WSN", "%DRSI_SITET",
                    "*%200_BAKER", "**EG1_MID", "*%200_BAKER")
  )

solution<-data.frame(stringsAsFactors=FALSE,
     segmentos_ID = c("%%EDH_WSN", "%DIPA_PITES", "%%EDH_WSN", "%DRSI_SITET",
                      "*%200_BAKER", "**EG1_MID", "*%200_BAKER"),
   Rumbo_circular = c(53, 297, 53, 335, 137, 321, 137)
)

I also add the " * " missing in the "solution" data.frame ( "* %200_BAKER"), as it was also a mistake.
Also, I describe again my problem: I have a list of names (call seg_ID) in a data.frame call "segments". This data.frame has info that describes each name, that is the "Rumbo" column. In other data.frame, call "trayectorias", I have lots of names (now call seg_ID_trayectorias to avoid having 2 columns with the same name), and all those names are included in the data.frame "segments", but some times repeated several times. What I would like to do is write a code that associates the "Rumbo" column to each name in the "trayectories" data.frame.

Hope now I have explain my issue better. Sorry for the mistakes and thank you so much for the help.

system · November 6, 2019, 8:36am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.