Assigning new columns to partial data frames

ctheisen · December 8, 2022, 4:37pm

So say I had my dataframes as outlined below:

proteinid<-c('Replication Factor','Ferredoxin','Stabilin','DNA-binding')
untreated<-c(0.1,-3,2.1,3)
treated<-c(2.1,-1,0.3,2)
proteinid2<-c('Replication Factor','Ferredoxin','Stabilin')
protein_function<-c('Transcription','ETC','Protein Folding')

numeric_data<-data.frame(proteinid,untreated,treated)
protein_roles<-data.frame(proteinid2,protein_function)

And I wanted to merge the two data frames, but I know one data.frame has three rows and the other has four. So I know this would leave some of the columns blanks, which is fine and what I want. When I try to just combine them I get an error due to the difference in the number of rows. Ideally I will be doing this on a much larger dataset imported from excel, so I wanted to see if there was a quick way to do this.

scottyd22 · December 8, 2022, 4:45pm

Is this what you have in mind?

library(dplyr)

out = left_join(numeric_data, protein_roles,
                by = c('proteinid' = 'proteinid2'))

out
#>            proteinid untreated treated protein_function
#> 1 Replication Factor       0.1     2.1    Transcription
#> 2         Ferredoxin      -3.0    -1.0              ETC
#> 3           Stabilin       2.1     0.3  Protein Folding
#> 4        DNA-binding       3.0     2.0             <NA>

Created on 2022-12-08 with reprex v2.0.2.9000

ctheisen · December 8, 2022, 4:54pm

That should work, thanks. But now let's say it's a little different. And this should have probably been in my original note, I apologize for not including it. So what if I want the assignments to match to rows that have names containing phrases in the proteinid2, but may not be a complete match, similar to the following:

proteinid<-c('Replication Factor','Ferredoxin','Stabilin','DNA-binding','Replication Factor alpha')
untreated<-c(0.1,-3,2.1,3,2)
treated<-c(2.1,-1,0.3,2,1.4)
proteinid2<-c('Replication Factor','Ferredoxin','Stabilin')
protein_function<-c('Transcription','ETC','Protein Folding')

numeric_data<-data.frame(proteinid,untreated,treated)
protein_roles<-data.frame(proteinid2,protein_function)

So for example, I want all rows that contain the phrase "Replication Factor" in proteinid to have the correct protein_function assignment, even if it is not an exact match. So both "Replication Factor" and "Replication Factor alpha" have the assignment "Transcription"

nirgrahamuk · December 8, 2022, 5:04pm

inexact string matching is an art not a science, and what best to do will be context dependent; it will be up to you to define a cutoff criteria for similarity. but you can use use the tools in stringdist package to solve this.

ctheisen · December 8, 2022, 7:08pm

I will define my own cut offs for the matching, but I just wanted to know how I would do it. I've never used the stringdist package, how would you set it up for this particular example?

system · December 29, 2022, 7:08pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.