Create loop to generate rows from newly generated rows

Hello everyone,

I'm new to R and I am trying to write a loop! (:

I have a data frame from a csv file that contains about 11 columns and hundred of rows. One of the columns ('Modifications), have the following text:

"1xTMT6plex [K18]; 1xTMT6plex [N-Term]; 1xPhospho [S1(99.9); S20(100)]"

I want to extract "1xPhospho [S1(99.9); S20(100)]" from column ("modification") to a new column (that I will call "Phospho"), then only keep the "1xPhospho [S1(99.9)]" part, and then create a new row ("Phosho_2") with "1xPhospho [S20(100)]" and all the information from the other columns would be copied from the original row.

In some cases, there might be needed more than 2 rows ("Phospho_3" and so on) because it could happen that there are more than one "S1(99.9)" type of information.

When this new rows are created, their ID (in the Accession column) should be the same as the original but with a _1..._n to the end. Finally, the original row containing the @1xPhospho [S1(99.9); S20(100)]" should be deleted.

I have the code for one specific case, but I would like to have it working for all the cases that are in the dataframe.

Here is the code I have now:

pmap[,"Phospho"] <- sub(".+([0-9]xPhospho [[:punct:]][A-Z0-9()/.; ]+[[:punct:]]).*", "\\1", pmap$modification)
tmp_df <- pmap[grep("[;]", pmap$Phospho),]
tmp_df$Phospho <- sub("([0-9]xPhospho [[:punct:]])[A-Z0-9()/.]+[;] (.+)", "\\1\\2", tmp_df$Phospho)
tmp_df$accession <- paste(tmp_df$accession, 1, sep = "_")
pmap <- rbind(pmap, tmp_df)
pmap[grep("[;]", pmap$Phospho),]
pmap[19, "Phospho"] <- sub("([0-9]xPhospho [[:punct:]])([A-Z0-9()/.]+)[;] .+", "\\1\\2]", pmap$Phospho[19])

Help!

Question: does the column 'Modifications' always have the same format? And is the 1xPhospho... always followed by exactly 2 values, or can it differ?

Hi, Pieter, thanks for your comment. The modifications column has text and numbers and they might change slightly, here are two examples " 1xTMT6plex [K15]; 1xTMT6plex [N-Term]; 1xPhospho [S5(100)]" and "1xOxidation [M14]; 1xTMT6plex [K16]; 1xTMT6plex [N-Term]; 1xPhospho [S11(99.5)]".

The "1xPhospho [S1(99.9)]" will always have "[letter-number(" but the part between the parenthesis can vary from 20-100, with one digit after the dot, so it could be, for example, "[T9(100)]". And the "1xPhospho" can be "2xPhospho" too.

Hi,

Try and run this code and see if it produces what you want:

library("stringr")
options(stringsAsFactors = F) #important for frame merging

#Create fake data (replace the name of the data frame with yours)
myData = data.frame(V1 = 1:3, V2 = LETTERS[1:3], Modifications = c(
  "1xTMT6plex [K15]; 1xTMT6plex [N-Term]; 1xPhospho [S5(100)]",
  "1xOxidation [M14]; 1xTMT6plex [K16]; 1xTMT6plex [N-Term]; 1xPhospho [S11(99.5)]",
  "1xTMT6plex [K18]; 1xTMT6plex [N-Term]; 1xPhospho [S1(99.9); S20(100)]"
), stringsAsFactors = F)

#Extract Phospho part from Modifications
phospho = str_extract(myData$Modifications, "\\dxPhospho.*")

#Remove the Modifications column
myData = myData[,-which(colnames(myData) == "Modifications")]

#Build the new rows 
# the rbind and lapply are just a more efficient way of writing a for-loop in case you wondered
myData = do.call(rbind, lapply(1:length(phospho), function(i){
  #See if there are more than 1 phospho in each row
  myList = unlist(strsplit(phospho[i], "; "))
  
  #If so, create the new values separately
  if(length(myList)>1){
    newList = list()
    myPrefix = str_extract(myList[1], "\\dxPhospho \\[")
    newList[1] = paste(myList[1], "]", sep = "")
    newList[2:length(myList)] = paste(myPrefix, myList[2:length(myList)], sep = "")
    myList = newList
  }
  
  #For each phospho, create row with rest of columns from original data
  cbind(myData[i,], data.frame(Modifications = unlist(myList)))
  
}))

Grtz,
PJ

Hi Pieter! So I'm having as a result a dataframe containing only the phospho information, I guess the cbind is not working? Here is the result I have:

           V1    V2     Modifications
       1	1	A	1xPhospho [S5(100)]	
       2	2	B	1xPhospho [S11(99.5)]	
      11	3	C	1xPhospho [S1(99.9)]	
      21	3	C	1xPhospho [S20(100)]

Hi,

I understood that you only wanted to keep that info and disregard the rest of the value in that column.

For example: "1xTMT6plex [K18]; 1xTMT6plex [N-Term]; 1xPhospho [S1(99.9); S20(100)]"
Only keep 1xPhospho [S1(99.9); S20(100)] and split it.
If you would like to keep the original, it's an easy fix

PJ