This question pertains to various genomics file-formats, which can be generalised like so - Let us say I get a file with this format:
d = data.frame(
VELQF = c("site_B", 29, 94, 32, 22, 58, TRUE),
WZGHL = c("site_B", NA, 95, 60, 27, 2, FALSE),
UAPVG = c("site_A", 70, 29, 17, 8, 23, FALSE),
THDKZ = c("site_A", 63, 26, 73, 12, 47, TRUE),
BEDFM = c("site_B", 25, 67, 25, 81, 16, FALSE),
YUDHX = c("NA", 66, 83, 97, 72, 10, NA),
WZHQT = c("site_A", 16, 37, 30, 45, 55, TRUE),
KLNPS = c("site_B", 5, 72, NA, 25, 67, FALSE),
MDPGC = c("site_B", 64, 68, 12, 85, 83, TRUE),
SAPHN = c("site_A", 93, 79, 51, 32, 29, TRUE),
row.names = c("s_id", "x1", "x2", "x3", "x4", "x5", "flag")
)
Which looks like so:
> d
VELQF WZGHL UAPVG THDKZ BEDFM YUDHX WZHQT KLNPS MDPGC SAPHN
s_id site_B site_B site_A site_A site_B NA site_A site_B site_B site_A
x1 29 <NA> 70 63 25 66 16 5 64 93
x2 94 95 29 26 67 83 37 72 68 79
x3 32 60 17 73 25 97 30 <NA> 12 51
x4 22 27 8 12 81 72 45 25 85 32
x5 58 2 23 47 16 10 55 67 83 29
flag TRUE FALSE FALSE TRUE FALSE <NA> TRUE FALSE TRUE TRUE
I.e. rows are variables and columns are observations. Which is wrong, it should be the other way around.
Using only base R, how do I transpose d
while retaining variable data type?
I can do this:
d_t = as.data.frame(x = t(d), stringsAsFactors = FALSE)
Yielding:
> d_t
s_id x1 x2 x3 x4 x5 flag
VELQF site_B 29 94 32 22 58 TRUE
WZGHL site_B <NA> 95 60 27 2 FALSE
UAPVG site_A 70 29 17 8 23 FALSE
THDKZ site_A 63 26 73 12 47 TRUE
BEDFM site_B 25 67 25 81 16 FALSE
YUDHX NA 66 83 97 72 10 <NA>
WZHQT site_A 16 37 30 45 55 TRUE
KLNPS site_B 5 72 <NA> 25 67 FALSE
MDPGC site_B 64 68 12 85 83 TRUE
SAPHN site_A 93 79 51 32 29 TRUE
Which looks good until you do
> str(d_t)
'data.frame': 10 obs. of 7 variables:
$ s_id: chr "site_B" "site_B" "site_A" "site_A" ...
$ x1 : chr "29" NA "70" "63" ...
$ x2 : chr "94" "95" "29" "26" ...
$ x3 : chr "32" "60" "17" "73" ...
$ x4 : chr "22" "27" "8" "12" ...
$ x5 : chr "58" "2" "23" "47" ...
$ flag: chr "TRUE" "FALSE" "FALSE" "TRUE" ...
I can then try to save this by doing:
> apply(d_t, 2, as.numeric)
s_id x1 x2 x3 x4 x5 flag
[1,] NA 29 94 32 22 58 NA
[2,] NA NA 95 60 27 2 NA
[3,] NA 70 29 17 8 23 NA
[4,] NA 63 26 73 12 47 NA
[5,] NA 25 67 25 81 16 NA
[6,] NA 66 83 97 72 10 NA
[7,] NA 16 37 30 45 55 NA
[8,] NA 5 72 NA 25 67 NA
[9,] NA 64 68 12 85 83 NA
[10,] NA 93 79 51 32 29 NA
Warning messages:
1: In apply(d_t, 2, as.numeric) : NAs introduced by coercion
2: In apply(d_t, 2, as.numeric) : NAs introduced by coercion
Which saves the numerics, but I loose the rownames
and also, the s_id
and flag
variables are now all NA
Any nice and relatively clean base solutions?