Reshaping a complex dataset

Lilip · February 4, 2022, 8:34pm

I have the following dataframe:

SNPs<-c("SNP_A", "SNP_B","SNP_C")
pop1_allele_count<-c(1,3,5)
pop1_allele_number<-c(100,100,100)
pop2_allele_count<-c(5, 7, 8)
pop2_allele_number<-c(100, 100, 100)
pop3_allele_count<-c(2, 3, 4)
pop3_allele_number<-c(100, 100, 100)
main<-data.frame(SNPs,pop1_allele_count,
pop1_allele_number,
pop2_allele_count,
pop2_allele_number,
pop3_allele_count,
pop3_allele_number)

Any ideas on how can I:

transpose "pop1_allele_count", "pop2_allele_count", "pop3_allele_count" as "count" variable and in another column, "pop1_allele_number", "pop2_allele_number", "pop3_allele_number" as "number" variable
SNPs variable is replicated by the number of "count" and "number"

The result I'd like to produce:

I have tried reshape, apply, gather, stack... but had no sucess

Thanks for your help

AlexisW · February 5, 2022, 2:08am

I find the functions pivot_wider() and pivot_longer() in the package tidyr to be the most explicit for these operations. There is a simple introduction in r4ds, and more complex usecases in that article.

In your case, you want to do 2 things: bring pop from the names to a column, and bring SNPs from a column to the names. So you can apply these two transformations successively:

library(tidyr)

SNPs<-c("SNP_A", "SNP_B","SNP_C")
pop1_allele_count<-c(1,3,5)
pop1_allele_number<-c(100,100,100)
pop2_allele_count<-c(5, 7, 8)
pop2_allele_number<-c(100, 100, 100)
pop3_allele_count<-c(2, 3, 4)
pop3_allele_number<-c(100, 100, 100)
main<-data.frame(SNPs,pop1_allele_count,
                 pop1_allele_number,
                 pop2_allele_count,
                 pop2_allele_number,
                 pop3_allele_count,
                 pop3_allele_number)



main
#>    SNPs pop1_allele_count pop1_allele_number pop2_allele_count
#> 1 SNP_A                 1                100                 5
#> 2 SNP_B                 3                100                 7
#> 3 SNP_C                 5                100                 8
#>   pop2_allele_number pop3_allele_count pop3_allele_number
#> 1                100                 2                100
#> 2                100                 3                100
#> 3                100                 4                100

pivot_longer(main,
             -SNPs,
             names_to = c("pop", "type"),
             names_pattern = "(pop[\\d])_([a-z_]+)",
             values_to = "vals") |>
  pivot_wider(id_cols = pop,
              names_from = c("type", "SNPs"),
              names_sep = "_",
              values_from = vals)
#> # A tibble: 3 x 7
#>   pop   allele_count_SNP_A allele_number_SNP_A allele_count_SN~ allele_number_S~
#>   <chr>              <dbl>               <dbl>            <dbl>            <dbl>
#> 1 pop1                   1                 100                3              100
#> 2 pop2                   5                 100                7              100
#> 3 pop3                   2                 100                3              100
#> # ... with 2 more variables: allele_count_SNP_C <dbl>,
#> #   allele_number_SNP_C <dbl>

^{Created on 2022-02-04 by the reprex package (v2.0.1)}

Lilip · February 5, 2022, 4:06pm

AlexisW:

pivot_longer(main,
             -SNPs,
             names_to = c("pop", "type"),
             names_pattern = "(pop[\\d])_([a-z_]+)",
             values_to = "vals") |>
  pivot_wider(id_cols = pop,
              names_from = c("type", "SNPs"),
              names_sep = "_",
              values_from = vals)

Thank you , Alexis. It really helped. I wasnt aware of these tidyr functions. It was clever how you thought about using pivot_longer and then pivot_winder.

system · February 26, 2022, 4:06pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.