I have a huge amount of information representing line names that I would like to reduce by grouping it. The beginning of the word stands for the starting point of the line, and the end for the last point of the line (Start and End are separated by a ""). I would like to group the data I consider duplicated under the same category, that is to say, some line names are duplicated as they are written twice or more, but only they are written the opposite way. Here I expose an example:
Let´s imagine these 4 words:
word1<- AAAA_AAAB
word2<- AAAB_AAAA
word3<-AAAA_AAAC
word4<-AAAC_AAAA
For me, word1 and word2 are the same line as word1 is one way and word2 the opposite one, but both the same line, e.g: line AAAA_AAAB. The same happens with word3 and word4, they are the same lines, just write the opposite way.
Therefore, I would like to create a column where all line names would be only named once, that is to say, only one name would be correct and opposites would be considered the same line. In the example above, word1 and word2 are the same line, now called "AAAA_AAAB" as "AAAA_AAAB" is alphabetically before "AAAB_AAAA" (I also have lines starting with symbols, such as "%" or "%%" that should go before alphabet)
Solution should be something like this:
data.frame(stringsAsFactors=FALSE,
ID = c("%%EE_AAAA", "AAAA_AAAB", "AAAC_AAAE", "AAAA%%EE",
"AAAB_AAAA", "AAAE_AAAC"),
ID_Filtered = c("%%EE_AAAA", "AAAA_AAAB", "AAAC_AAAE", "%%EE_AAAA",
"AAAA_AAAB", "AAAC_AAAE")
)
ID is the name before grouping it and ID_Filtered are line names once considering only one way (alpha/numeric order)