What's the best way to convert vectors to one-row data frames for use with unnest?

I seem to frequently find myself wanting to unnest list-columns that contain vectors, because they should really be their own columns. Often if we use lapply or map to iterate we end up with a function that returns a vector, such as with quantile below. We could imagine wanting to iterate over many different vectors of distributions with different parameters and getting quantiles. However, in order to use unnest to get multiple columns out, we need a one-row data frame. The most "obvious" way of doing it with tidyverse functions that I could see was enframe and then spread, since enframe is supposed to be the standard function for creating a tibble from a vector. However, spread is not fast and calling it for every row can quickly become undesirable.

Here I benchmarked a few different alternatives that I could think of, mostly running through matrix. I'm not the best at profiling and am not too sure why the saving of one names<- call gets such a boost, but all of these options are much, much faster than the seemingly "neat" method using enframe.

The question is: Am I missing some other method that would be faster?
The discussion part is: Should this operation be made easier, or approached in some other manner?

set.seed(1)
named_vec <- quantile(rnorm(1000), c(0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95))
named_vec
#>          5%         10%         25%         50%         75%         90% 
#> -1.72695999 -1.33933368 -0.69737322 -0.03532423  0.68842795  1.32402975 
#>         95% 
#>  1.74398317

library(tidyverse)
bench::mark(
  enframe(named_vec) %>% spread(name, value),
  as_tibble(matrix(named_vec, nrow = 1, dimnames = list(NULL, names(named_vec)))),
  data.frame(matrix(named_vec, nrow = 1)) %>% `names<-`(names(named_vec)),
  as.data.frame(matrix(named_vec, nrow = 1)) %>% `names<-`(names(named_vec)),
  as.data.frame(matrix(named_vec, nrow = 1, dimnames = list(NULL, names(named_vec))))
)
#> # A tibble: 5 x 10
#>   expression      min     mean   median      max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl>
#> 1 enframe(n…   1.36ms   1.56ms   1.52ms   2.15ms      640.     634KB    10
#> 2 as_tibble… 262.67µs  300.5µs 287.38µs  663.3µs     3328.        0B    13
#> 3 data.fram…  138.9µs 166.06µs  162.3µs 402.71µs     6022.      280B    11
#> 4 as.data.f…  73.77µs  86.93µs  84.42µs 313.09µs    11503.      280B    14
#> 5 as.data.f…  16.22µs  19.55µs  18.92µs 120.47µs    51151.        0B     8
#> # … with 2 more variables: n_itr <int>, total_time <bch:tm>

Created on 2019-04-25 by the reprex package (v0.2.1)

Welcome to the community!

I don't know the answer to your question, but I'd like to add another way as.data.frame(t(named_vec)) (which seems most obvious to me) to do this so that people who know the answer will also consider this option. As you can see, it's certainly not the fastest (but IMHO, "neat" and "intuitive"), but close enough to be considered as an alternative.

set.seed(1)

named_vec <- quantile(x = rnorm(1000), probs = c(0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95))

library(tidyverse)

bench::mark(Enframe = enframe(x = named_vec) %>% spread(key = name, value = value),
            Tibble = as_tibble(x = matrix(data = named_vec, nrow = 1, dimnames = list(NULL, names(x = named_vec)))),
            DataFrame = data.frame(matrix(data = named_vec, nrow = 1)) %>% `names<-`(names(x = named_vec)),
            AsDataFrameNames = as.data.frame(matrix(data = named_vec, nrow = 1)) %>% `names<-`(names(x = named_vec)),
            AsDataFrameDimnames = as.data.frame(matrix(data = named_vec, nrow = 1, dimnames = list(NULL, names(x = named_vec)))),
            AsDataFrameTranspose = as.data.frame(t(x = named_vec)))
#> # A tibble: 6 x 10
#>   expression             min            mean         median            max
#>   <chr>             <bch:tm>        <bch:tm>       <bch:tm>       <bch:tm>
#> 1 Enframe             1.43ms          1.51ms          1.5ms         5.49ms
#> 2 Tibble     252.05<U+00B5>s 291.99<U+00B5>s 285.7<U+00B5>s         8.32ms
#> 3 DataFrame  119.48<U+00B5>s 134.28<U+00B5>s 130.9<U+00B5>s         4.32ms
#> 4 AsDataFra…  61.78<U+00B5>s  69.17<U+00B5>s  67.3<U+00B5>s         4.09ms
#> 5 AsDataFra…  14.58<U+00B5>s  16.75<U+00B5>s  16.4<U+00B5>s 60.23<U+00B5>s
#> 6 AsDataFra…  14.64<U+00B5>s  17.38<U+00B5>s    16<U+00B5>s         4.04ms
#> # … with 5 more variables: `itr/sec` <dbl>, mem_alloc <bch:byt>,
#> #   n_gc <dbl>, n_itr <int>, total_time <bch:tm>

(I generated this using RStudio cloud, and apparently it can't recognise \mu from <U+00B5>)

2 Likes

I'm not sure on speed, but the development version of tidyr has functions unnest_longer()/unnest_wider() that appear to do what you're after.

From the NEWS:

New unnest_longer() and unnest_wider() make it easier to unnest list-columns of vectors into either rows or columns (#418)

Given a list column of vectors like you described, you could use unnest_wider() to unnest them into a wide format.

quants = tibble(x = list(named_vec, named_vec) )
unnest_wider(quants, x)
# A tibble: 2 x 7
   `5%` `10%`  `25%`   `50%` `75%` `90%` `95%`
  <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl>
1 -1.73 -1.34 -0.697 -0.0353 0.688  1.32  1.74
2 -1.73 -1.34 -0.697 -0.0353 0.688  1.32  1.74
1 Like

I'm not familiar with purrr, but vectorize and mapply can combine the looping and combining steps:

samples <- replicate(5, runif(1000), simplify = FALSE)
mydata <- data.frame(s = I(samples))
mydata
#              s
# 1 0.414275....
# 2 0.156483....
# 3 0.992391....
# 4 0.432951....
# 5 0.302785....

quants <- vapply(
  X = mydata[["s"]],
  FUN = quantile,
  FUN.VALUE = numeric(7),
  probs = c(0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95)
)
quants
#           [,1]       [,2]       [,3]       [,4]       [,5]
# 5%  0.04082008 0.04507124 0.05049143 0.04776495 0.04590253
# 10% 0.09333178 0.10545829 0.10458423 0.09645673 0.08284631
# 25% 0.23145141 0.24900141 0.25052316 0.24397599 0.22676617
# 50% 0.50762321 0.52104472 0.49992605 0.51110439 0.48699883
# 75% 0.75176868 0.76793539 0.75177977 0.76025279 0.75084742
# 90% 0.91539669 0.90923986 0.90348624 0.90909336 0.89700135
# 95% 0.96388973 0.95451164 0.95004418 0.95091292 0.94022864

mydata[rownames(quants)] <- as.data.frame(t(quants))
mydata
#              s         5%        10%       25%       50%       75%       90%       95%
# 1 0.414275.... 0.04082008 0.09333178 0.2314514 0.5076232 0.7517687 0.9153967 0.9638897
# 2 0.156483.... 0.04507124 0.10545829 0.2490014 0.5210447 0.7679354 0.9092399 0.9545116
# 3 0.992391.... 0.05049143 0.10458423 0.2505232 0.4999260 0.7517798 0.9034862 0.9500442
# 4 0.432951.... 0.04776495 0.09645673 0.2439760 0.5111044 0.7602528 0.9090934 0.9509129
# 5 0.302785.... 0.04590253 0.08284631 0.2267662 0.4869988 0.7508474 0.8970013 0.9402286

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.