How to transpose data frame and retain variable data types?

Leon · November 27, 2019, 6:38pm

This question pertains to various genomics file-formats, which can be generalised like so - Let us say I get a file with this format:

d = data.frame(
  VELQF = c("site_B", 29, 94, 32, 22, 58, TRUE),
  WZGHL = c("site_B", NA, 95, 60, 27,  2, FALSE),
  UAPVG = c("site_A", 70, 29, 17,  8, 23, FALSE),
  THDKZ = c("site_A", 63, 26, 73, 12, 47, TRUE),
  BEDFM = c("site_B", 25, 67, 25, 81, 16, FALSE),
  YUDHX = c("NA", 66, 83, 97, 72, 10, NA),
  WZHQT = c("site_A", 16, 37, 30, 45, 55, TRUE),
  KLNPS = c("site_B",  5, 72, NA, 25, 67, FALSE),
  MDPGC = c("site_B", 64, 68, 12, 85, 83, TRUE),
  SAPHN = c("site_A", 93, 79, 51, 32, 29, TRUE),
  row.names = c("s_id", "x1", "x2", "x3", "x4", "x5", "flag")
)

Which looks like so:

> d
      VELQF  WZGHL  UAPVG  THDKZ  BEDFM YUDHX  WZHQT  KLNPS  MDPGC  SAPHN
s_id site_B site_B site_A site_A site_B    NA site_A site_B site_B site_A
x1       29   <NA>     70     63     25    66     16      5     64     93
x2       94     95     29     26     67    83     37     72     68     79
x3       32     60     17     73     25    97     30   <NA>     12     51
x4       22     27      8     12     81    72     45     25     85     32
x5       58      2     23     47     16    10     55     67     83     29
flag   TRUE  FALSE  FALSE   TRUE  FALSE  <NA>   TRUE  FALSE   TRUE   TRUE

I.e. rows are variables and columns are observations. Which is wrong, it should be the other way around.

Using only base R, how do I transpose d while retaining variable data type?

I can do this:

d_t = as.data.frame(x = t(d), stringsAsFactors = FALSE)

Yielding:

> d_t
        s_id   x1 x2   x3 x4 x5  flag
VELQF site_B   29 94   32 22 58  TRUE
WZGHL site_B <NA> 95   60 27  2 FALSE
UAPVG site_A   70 29   17  8 23 FALSE
THDKZ site_A   63 26   73 12 47  TRUE
BEDFM site_B   25 67   25 81 16 FALSE
YUDHX     NA   66 83   97 72 10  <NA>
WZHQT site_A   16 37   30 45 55  TRUE
KLNPS site_B    5 72 <NA> 25 67 FALSE
MDPGC site_B   64 68   12 85 83  TRUE
SAPHN site_A   93 79   51 32 29  TRUE

Which looks good until you do

> str(d_t)
'data.frame':	10 obs. of  7 variables:
 $ s_id: chr  "site_B" "site_B" "site_A" "site_A" ...
 $ x1  : chr  "29" NA "70" "63" ...
 $ x2  : chr  "94" "95" "29" "26" ...
 $ x3  : chr  "32" "60" "17" "73" ...
 $ x4  : chr  "22" "27" "8" "12" ...
 $ x5  : chr  "58" "2" "23" "47" ...
 $ flag: chr  "TRUE" "FALSE" "FALSE" "TRUE" ...

I can then try to save this by doing:

> apply(d_t, 2, as.numeric)
      s_id x1 x2 x3 x4 x5 flag
 [1,]   NA 29 94 32 22 58   NA
 [2,]   NA NA 95 60 27  2   NA
 [3,]   NA 70 29 17  8 23   NA
 [4,]   NA 63 26 73 12 47   NA
 [5,]   NA 25 67 25 81 16   NA
 [6,]   NA 66 83 97 72 10   NA
 [7,]   NA 16 37 30 45 55   NA
 [8,]   NA  5 72 NA 25 67   NA
 [9,]   NA 64 68 12 85 83   NA
[10,]   NA 93 79 51 32 29   NA
Warning messages:
1: In apply(d_t, 2, as.numeric) : NAs introduced by coercion
2: In apply(d_t, 2, as.numeric) : NAs introduced by coercion

Which saves the numerics, but I loose the rownames and also, the s_id and flag variables are now all NA

Any nice and relatively clean base solutions?

technocrat · November 27, 2019, 6:52pm

Couple of complications in the example. d$row.names is null and d's columns are all factors. I'm sure you can squirrel each away in vectors for use in re-setting row and column names. One way to pivot in base:

d = data.frame(
  VELQF = c("site_B", 29, 94, 32, 22, 58, TRUE),
  WZGHL = c("site_B", NA, 95, 60, 27,  2, FALSE),
  UAPVG = c("site_A", 70, 29, 17,  8, 23, FALSE),
  THDKZ = c("site_A", 63, 26, 73, 12, 47, TRUE),
  BEDFM = c("site_B", 25, 67, 25, 81, 16, FALSE),
  YUDHX = c("NA", 66, 83, 97, 72, 10, NA),
  WZHQT = c("site_A", 16, 37, 30, 45, 55, TRUE),
  KLNPS = c("site_B",  5, 72, NA, 25, 67, FALSE),
  MDPGC = c("site_B", 64, 68, 12, 85, 83, TRUE),
  SAPHN = c("site_A", 93, 79, 51, 32, 29, TRUE),
  row.names = c("s_id", "x1", "x2", "x3", "x4", "x5", "flag")
)

d_new <- t(matrix(d))

d_new
#>      [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]    
#> [1,] factor,7 factor,7 factor,7 factor,7 factor,7 factor,7 factor,7 factor,7
#>      [,9]     [,10]   
#> [1,] factor,7 factor,7

d_new[1,1]
#> [[1]]
#> [1] site_B 29     94     32     22     58     TRUE  
#> Levels: 22 29 32 58 94 site_B TRUE

^{Created on 2019-11-27 by the reprex package (v0.3.0)}

Leon · November 29, 2019, 8:09am

No and yes

In creating d, I set:

row.names = c("s_id", "x1", "x2", "x3", "x4", "x5", "flag")

The code I state is doable if I hardcode, but creating a dynamic function for transposing a data.frame() is surprisingly cumbersome...

technocrat · November 29, 2019, 8:06pm

OK, this has been instructive for me and given me more insight for why base needed tidying.

Every column in d is an atomic vector. ∴ each element must be of the same class

v <- c(42,"towel", TRUE)
class(v)
#> [1] "character"

^{Created on 2019-11-29 by the reprex package (v0.3.0)}
that contains a desired mix of integers, characters and booleans

A base function to transpose d would have to go through an intermediate step of creating lists as column values.

Leon · December 20, 2019, 8:07pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.