How to transpose data frame and retain variable data types?

This question pertains to various genomics file-formats, which can be generalised like so - Let us say I get a file with this format:

d = data.frame(
  VELQF = c("site_B", 29, 94, 32, 22, 58, TRUE),
  WZGHL = c("site_B", NA, 95, 60, 27,  2, FALSE),
  UAPVG = c("site_A", 70, 29, 17,  8, 23, FALSE),
  THDKZ = c("site_A", 63, 26, 73, 12, 47, TRUE),
  BEDFM = c("site_B", 25, 67, 25, 81, 16, FALSE),
  YUDHX = c("NA", 66, 83, 97, 72, 10, NA),
  WZHQT = c("site_A", 16, 37, 30, 45, 55, TRUE),
  KLNPS = c("site_B",  5, 72, NA, 25, 67, FALSE),
  MDPGC = c("site_B", 64, 68, 12, 85, 83, TRUE),
  SAPHN = c("site_A", 93, 79, 51, 32, 29, TRUE),
  row.names = c("s_id", "x1", "x2", "x3", "x4", "x5", "flag")
)

Which looks like so:

> d
      VELQF  WZGHL  UAPVG  THDKZ  BEDFM YUDHX  WZHQT  KLNPS  MDPGC  SAPHN
s_id site_B site_B site_A site_A site_B    NA site_A site_B site_B site_A
x1       29   <NA>     70     63     25    66     16      5     64     93
x2       94     95     29     26     67    83     37     72     68     79
x3       32     60     17     73     25    97     30   <NA>     12     51
x4       22     27      8     12     81    72     45     25     85     32
x5       58      2     23     47     16    10     55     67     83     29
flag   TRUE  FALSE  FALSE   TRUE  FALSE  <NA>   TRUE  FALSE   TRUE   TRUE

I.e. rows are variables and columns are observations. Which is wrong, it should be the other way around.

Using only base R, how do I transpose d while retaining variable data type?

I can do this:

d_t = as.data.frame(x = t(d), stringsAsFactors = FALSE)

Yielding:

> d_t
        s_id   x1 x2   x3 x4 x5  flag
VELQF site_B   29 94   32 22 58  TRUE
WZGHL site_B <NA> 95   60 27  2 FALSE
UAPVG site_A   70 29   17  8 23 FALSE
THDKZ site_A   63 26   73 12 47  TRUE
BEDFM site_B   25 67   25 81 16 FALSE
YUDHX     NA   66 83   97 72 10  <NA>
WZHQT site_A   16 37   30 45 55  TRUE
KLNPS site_B    5 72 <NA> 25 67 FALSE
MDPGC site_B   64 68   12 85 83  TRUE
SAPHN site_A   93 79   51 32 29  TRUE

Which looks good until you do

> str(d_t)
'data.frame':	10 obs. of  7 variables:
 $ s_id: chr  "site_B" "site_B" "site_A" "site_A" ...
 $ x1  : chr  "29" NA "70" "63" ...
 $ x2  : chr  "94" "95" "29" "26" ...
 $ x3  : chr  "32" "60" "17" "73" ...
 $ x4  : chr  "22" "27" "8" "12" ...
 $ x5  : chr  "58" "2" "23" "47" ...
 $ flag: chr  "TRUE" "FALSE" "FALSE" "TRUE" ...

I can then try to save this by doing:

> apply(d_t, 2, as.numeric)
      s_id x1 x2 x3 x4 x5 flag
 [1,]   NA 29 94 32 22 58   NA
 [2,]   NA NA 95 60 27  2   NA
 [3,]   NA 70 29 17  8 23   NA
 [4,]   NA 63 26 73 12 47   NA
 [5,]   NA 25 67 25 81 16   NA
 [6,]   NA 66 83 97 72 10   NA
 [7,]   NA 16 37 30 45 55   NA
 [8,]   NA  5 72 NA 25 67   NA
 [9,]   NA 64 68 12 85 83   NA
[10,]   NA 93 79 51 32 29   NA
Warning messages:
1: In apply(d_t, 2, as.numeric) : NAs introduced by coercion
2: In apply(d_t, 2, as.numeric) : NAs introduced by coercion

Which saves the numerics, but I loose the rownames and also, the s_id and flag variables are now all NA

Any nice and relatively clean base solutions?

Couple of complications in the example. d$row.names is null and d's columns are all factors. I'm sure you can squirrel each away in vectors for use in re-setting row and column names. One way to pivot in base:

d = data.frame(
  VELQF = c("site_B", 29, 94, 32, 22, 58, TRUE),
  WZGHL = c("site_B", NA, 95, 60, 27,  2, FALSE),
  UAPVG = c("site_A", 70, 29, 17,  8, 23, FALSE),
  THDKZ = c("site_A", 63, 26, 73, 12, 47, TRUE),
  BEDFM = c("site_B", 25, 67, 25, 81, 16, FALSE),
  YUDHX = c("NA", 66, 83, 97, 72, 10, NA),
  WZHQT = c("site_A", 16, 37, 30, 45, 55, TRUE),
  KLNPS = c("site_B",  5, 72, NA, 25, 67, FALSE),
  MDPGC = c("site_B", 64, 68, 12, 85, 83, TRUE),
  SAPHN = c("site_A", 93, 79, 51, 32, 29, TRUE),
  row.names = c("s_id", "x1", "x2", "x3", "x4", "x5", "flag")
)

d_new <- t(matrix(d))

d_new
#>      [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]    
#> [1,] factor,7 factor,7 factor,7 factor,7 factor,7 factor,7 factor,7 factor,7
#>      [,9]     [,10]   
#> [1,] factor,7 factor,7

d_new[1,1]
#> [[1]]
#> [1] site_B 29     94     32     22     58     TRUE  
#> Levels: 22 29 32 58 94 site_B TRUE

Created on 2019-11-27 by the reprex package (v0.3.0)

No and yes :slightly_smiling_face:

In creating d, I set:

row.names = c("s_id", "x1", "x2", "x3", "x4", "x5", "flag")

The code I state is doable if I hardcode, but creating a dynamic function for transposing a data.frame() is surprisingly cumbersome...

1 Like

OK, this has been instructive for me and given me more insight for why base needed tidying.

Every column in d is an atomic vector. ∴ each element must be of the same class

v <- c(42,"towel", TRUE)
class(v)
#> [1] "character"

Created on 2019-11-29 by the reprex package (v0.3.0)
that contains a desired mix of integers, characters and booleans

A base function to transpose d would have to go through an intermediate step of creating lists as column values.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.