How to create new variables using "add_column"?

Jack19 · November 22, 2020, 5:02pm

Hello, I have a data frame as following (I use "," to separate columns):
Marker, Alleles, Line1, Line2, Line3, Line4, Line5, Line6, Line7
1, C/G, C, Y, C, C, G, Y, N
2, A/T, A, T, T, N, K, T, A
3, G/A, A, N, G, G, G, A, X,

I would like to use "add_column" to add new variables: Allele_A = the first letter in Alleles column, Allele_B = the second letter in the Alleles column, Allele_missing = "N", Allele_H = all other letters (Y, K, or X...) that appears among Lines 1-7.
I know to how to assign the first three variables: Newfile<-add_column(file, Allele_A=substring(file$Alleles,1,1), Allele_B=substring(file$Alleles,3,3), Allele_missing="N", ), but I am not sure how to assign the fourth one (Allele_H)
Can anyone help on it? Thanks a lot!

Jack

AlexisW · November 23, 2020, 3:40am

If you use the tidyverse, I don't think add_column() is the best way, since you need to always reference the original data frame (file in your example), it's more natural to use mutate():

library(tidyverse)

read_csv("Marker, Alleles, Line1, Line2, Line3, Line4, Line5, Line6, Line7
1, C/G, C, Y, C, C, G, Y, N
2, A/T, A, T, T, N, K, T, A
3, G/A, A, N, G, G, G, A, X") %>%
  mutate(Allele_A = substring(Alleles,1,1),
         Allele_B = substring(Alleles,3,3),
         Allele_missing = "N")

Then, for Allele_H, it's a bit more complicated. If I understand correctly, you want to take all values in Line1 to Line7, select the unique() ones, and concatenate them together with paste0(.x, collapse=""). Since you don't want to name each of the columns manually, you will want to select them using e.g. starts_with("Line"), that means you have to use a function that accepts selecting, such as across(). The additional difficulty is that you need to work on rows, since a standard mutate() would collapse each column. So that gives this somewhat big expression:

library(tidyverse)

read_csv("Marker, Alleles, Line1, Line2, Line3, Line4, Line5, Line6, Line7
1, C/G, C, Y, C, C, G, Y, N
2, A/T, A, T, T, N, K, T, A
3, G/A, A, N, G, G, G, A, X") %>%
  mutate(Allele_A=substring(Alleles,1,1),
         Allele_B=substring(Alleles,3,3),
         Allele_missing="N") %>%
  rowwise() %>%
  mutate(Allele_H = paste0(unique(c_across(starts_with("Line"))),collapse=""))

Where the last column is:

  ... %>%
  pull(Allele_H)
#> [1] "CYGN" "ATNK" "ANGX"

Of course, you can separate the function that generate Allele_H, especially if you want to add some sorting or some complex operation:

get_H <- function(vec){
  c_across(all_of(vec)) %>%
    unique() %>%
    sort() %>%
    paste0(collapse = "")
}

read_csv("Marker, Alleles, Line1, Line2, Line3, Line4, Line5, Line6, Line7
1, C/G, C, Y, C, C, G, Y, N
2, A/T, A, T, T, N, K, T, A
3, G/A, A, N, G, G, G, A, X") %>%
  mutate(Allele_A=substring(Alleles,1,1),
         Allele_B=substring(Alleles,3,3),
         Allele_missing="N") %>%
  rowwise() %>%
  mutate(Allele_H = get_H(starts_with("Line"))) %>%
  pull(Allele_H)
#> [1] "CGNY" "AKNT" "AGNX"

Jack19 · November 23, 2020, 4:12pm

Thank you very much, AlexisW!
I might haven't explained my case very well. In my example, I would like to get "Y", "K", and "X" for Allele_H in row 1, 2, and 3. In my case, each row has a letter that is different from Allele_A, Allele_B, and Allele_missing. I want to identify that letter from each row and then assign it to variable Allele_H. I am not sure how to do it.

AlexisW · November 23, 2020, 6:11pm

You can try using setdiff() or %in%. This should work:

library(tidyverse)


get_H <- function(){
  res <- c_across(starts_with("Line")) %>%
    setdiff(c_across(starts_with("Allele")))
  
  stopifnot(length(res) == 1)
  
  res
}

read_csv("Marker, Alleles, Line1, Line2, Line3, Line4, Line5, Line6, Line7
1, C/G, C, Y, C, C, G, Y, N
2, A/T, A, T, T, N, K, T, A
3, G/A, A, N, G, G, G, A, X") %>%
  mutate(Allele_A=substring(Alleles,1,1),
         Allele_B=substring(Alleles,3,3),
         Allele_missing="N") %>%
  rowwise() %>%
  mutate(Allele_H = get_H())

AlexisW · November 23, 2020, 6:16pm

Or with a cleaner call (actually I'm not sure why the previous one worked):

get_H <- function(lines, backgd){
  res <- setdiff(lines, backgd)
  
  stopifnot(length(res) == 1)
  
  res
}

read_csv("Marker, Alleles, Line1, Line2, Line3, Line4, Line5, Line6, Line7
1, C/G, C, Y, C, C, G, Y, N
2, A/T, A, T, T, N, K, T, A
3, G/A, A, N, G, G, G, A, X") %>%
  mutate(Allele_A=substring(Alleles,1,1),
         Allele_B=substring(Alleles,3,3),
         Allele_missing="N") %>%
  rowwise() %>%
  mutate(Allele_H = get_H(c_across(starts_with("Line")),
                          c_across(starts_with("Allele"))))

Jack19 · November 23, 2020, 8:50pm

Thank you, AlexisW! I tried the last one you sent and it worked. I have one more question about "starts_with ("Line")". In this case, all the columns that I want to search across have the common name "Line". But if all those columns' name don't have a common feature, how should I modify this code if I want to use column numbers (i.e. start from column 3, or from column 3 to 9)?

Jack19 · November 23, 2020, 9:12pm

AlexisW,
What should I add in the mutate code if I want put to these new columns before the third column?

Thanks a lot!

AlexisW · November 23, 2020, 9:41pm

Take a look here, you can just use the column numbers, e.g. 3:9.

You probably want the .before argument to mutate: mutate(Allele_A=substring(Alleles,1,1), .before=1). Alternatively, it may be a good idea to apply select() at the end of the pipeline to both select the columns to keep and order them:

data %>%
  mutate(...) %>%
  select(colC, colA, colB)

Jack19 · November 23, 2020, 10:46pm

AlexisW,

I tried different ways to use 3:9 to replace c_across(starts_with("Line")), but all failed. Could you be more detailed about this?

Thanks!

Jack

AlexisW · November 23, 2020, 10:55pm

c_across(3:9) since you need to use it inside a selecting function such as c_across(). The downside of selecting by column number is that it changes if the columns are reordered:

read_csv("Marker, Alleles, Line1, Line2, Line3, Line4, Line5, Line6, Line7
1, C/G, C, Y, C, C, G, Y, N
2, A/T, A, T, T, N, K, T, A
3, G/A, A, N, G, G, G, A, X") %>%
  mutate(Allele_A=substring(Alleles,1,1),
         Allele_B=substring(Alleles,3,3),
         Allele_missing="N",
         .before= "Marker") %>%
  rowwise() %>%
  mutate(Allele_H = get_H(c_across(6:12),
                          c_across(1:3)),
         .before = "Marker")

Jack19 · November 23, 2020, 11:15pm

AlexisW,
I got it now. Thank you for your patience! I learned a lot from you! It is a great help!

Jack

AlexisW · November 23, 2020, 11:54pm

Great!

To be honest the across() and selection aspects are among the hardest ones in the dplyr/tidyverse approach.

johnwilldonicely · November 26, 2020, 4:04pm

You don't need tidyverse, mutate, or anything other than Rbase commands to do this. The command is paste(), as follows:

 file$Allele_all= paste(file$Line1,file$Line2,file$Line3,file$Line4,file$Line5,file$Line6,file$Line7)

system · December 3, 2020, 4:04pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.