Assigning new columns to partial data frames

So say I had my dataframes as outlined below:

proteinid<-c('Replication Factor','Ferredoxin','Stabilin','DNA-binding')
untreated<-c(0.1,-3,2.1,3)
treated<-c(2.1,-1,0.3,2)
proteinid2<-c('Replication Factor','Ferredoxin','Stabilin')
protein_function<-c('Transcription','ETC','Protein Folding')

numeric_data<-data.frame(proteinid,untreated,treated)
protein_roles<-data.frame(proteinid2,protein_function)

And I wanted to merge the two data frames, but I know one data.frame has three rows and the other has four. So I know this would leave some of the columns blanks, which is fine and what I want. When I try to just combine them I get an error due to the difference in the number of rows. Ideally I will be doing this on a much larger dataset imported from excel, so I wanted to see if there was a quick way to do this.

Is this what you have in mind?

library(dplyr)

out = left_join(numeric_data, protein_roles,
                by = c('proteinid' = 'proteinid2'))

out
#>            proteinid untreated treated protein_function
#> 1 Replication Factor       0.1     2.1    Transcription
#> 2         Ferredoxin      -3.0    -1.0              ETC
#> 3           Stabilin       2.1     0.3  Protein Folding
#> 4        DNA-binding       3.0     2.0             <NA>

Created on 2022-12-08 with reprex v2.0.2.9000

That should work, thanks. But now let's say it's a little different. And this should have probably been in my original note, I apologize for not including it. So what if I want the assignments to match to rows that have names containing phrases in the proteinid2, but may not be a complete match, similar to the following:

proteinid<-c('Replication Factor','Ferredoxin','Stabilin','DNA-binding','Replication Factor alpha')
untreated<-c(0.1,-3,2.1,3,2)
treated<-c(2.1,-1,0.3,2,1.4)
proteinid2<-c('Replication Factor','Ferredoxin','Stabilin')
protein_function<-c('Transcription','ETC','Protein Folding')

numeric_data<-data.frame(proteinid,untreated,treated)
protein_roles<-data.frame(proteinid2,protein_function)

So for example, I want all rows that contain the phrase "Replication Factor" in proteinid to have the correct protein_function assignment, even if it is not an exact match. So both "Replication Factor" and "Replication Factor alpha" have the assignment "Transcription"

inexact string matching is an art not a science, and what best to do will be context dependent; it will be up to you to define a cutoff criteria for similarity. but you can use use the tools in stringdist package to solve this.

I will define my own cut offs for the matching, but I just wanted to know how I would do it. I've never used the stringdist package, how would you set it up for this particular example?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.