Avoid duplicated information

I have a huge amount of information representing line names that I would like to reduce by grouping it. The beginning of the word stands for the starting point of the line, and the end for the last point of the line (Start and End are separated by a ""). I would like to group the data I consider duplicated under the same category, that is to say, some line names are duplicated as they are written twice or more, but only they are written the opposite way. Here I expose an example:
Let´s imagine these 4 words:
word1<- AAAA_AAAB
word2<- AAAB_AAAA
word3<-AAAA_AAAC
word4<-AAAC_AAAA
For me, word1 and word2 are the same line as word1 is one way and word2 the opposite one, but both the same line, e.g: line AAAA_AAAB. The same happens with word3 and word4, they are the same lines, just write the opposite way.
Therefore, I would like to create a column where all line names would be only named once, that is to say, only one name would be correct and opposites would be considered the same line. In the example above, word1 and word2 are the same line, now called "AAAA_AAAB" as "AAAA_AAAB" is alphabetically before "AAAB_AAAA" (I also have lines starting with symbols, such as "%" or "%%" that should go before alphabet)
Solution should be something like this:
data.frame(stringsAsFactors=FALSE,
ID = c("%%EE_AAAA", "AAAA_AAAB", "AAAC_AAAE", "AAAA
%%EE",
"AAAB_AAAA", "AAAE_AAAC"),
ID_Filtered = c("%%EE_AAAA", "AAAA_AAAB", "AAAC_AAAE", "%%EE_AAAA",
"AAAA_AAAB", "AAAC_AAAE")
)
ID is the name before grouping it and ID_Filtered are line names once considering only one way (alpha/numeric order)

Hi,

Here is one possible way to get this:

library(stringr)
library(dplyr)

myData = data.frame(stringsAsFactors=FALSE,
           ID = c("%%EE_AAAA", "AAAA_AAAB", "AAAC_AAAE", "AAAA_%%EE",
                  "AAAB_AAAA", "AAAE_AAAC"))

myData = myData %>% mutate(newID = sapply(ID, function(x){
  x = unlist(str_split(x, "_")) #split the ID by _
  paste(sort(x), collapse = "_") #sort the words alphabetically and paste again
}))

myData
         ID     newID
1 %%EE_AAAA %%EE_AAAA
2 AAAA_AAAB AAAA_AAAB
3 AAAC_AAAE AAAC_AAAE
4 AAAA_%%EE %%EE_AAAA
5 AAAB_AAAA AAAA_AAAB
6 AAAE_AAAC AAAC_AAAE

This code does assume however that all IDs are split by a an underscore _
I did change the example "AAAA %%EE" in the original list as I assumed it was to be "AAAA_ %%EE". If this is not the case, my method would not work and needs to be updated.

Hope this helps,
PJ

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.