Plyr::mdply equivalent

baptiste · September 14, 2017, 8:04pm

Consider the following problem: I have a function with multiple scalar parameters, and I wish to visualise their influence on the result (a data.frame). My favourite construct for this is plyr::mdply, as illustrated below:

weibull <- function(alpha, lambda, time){
  data.frame(time = time, value = alpha*lambda*(time^(alpha-1)))
}

library(plyr)
library(ggplot2)

params <- expand.grid(lambda = c(1, 2, 4, 8, 16), alpha = c(0.5, 1))

all <- mdply(params, weibull, time = seq(0, 2, length=100))

ggplot(all, aes(time, value, colour=factor(lambda)))+
  facet_wrap(~alpha,scales="free", ncol=2) + geom_line()

It's very concise and expressive. Now, have not found the equivalent one-step solution in the tidyverse.

My current strategy is the following,

weibull <- function(alpha, lambda, time){
  data.frame(time = time, value = alpha*lambda*(time^(alpha-1)))
}

library(ggplot2)
library(tidyverse)

params <- tidyr::crossing(lambda = c(1, 2, 4, 8, 16), alpha = c(0.5, 1))


params %>% 
  dplyr::mutate(purrr::pmap(., .f = weibull,  time = seq(0, 2, length=100))) %>% 
  tidyr::unnest() %>% 
  ggplot(aes(time, value, colour=factor(lambda)))+
    facet_wrap(~alpha,scales="free", ncol=2) + geom_line()

I was hoping initially something like params %>% purrr::pmap_df(.f = weibull, time = seq(0, 2, length=100)) would do the job, but it does not keep track of the original parameters.

Another option might be to use the .id variable to merge the original params with the results, but that's a non-trivial extra step.

Emilhvitfeldt · September 15, 2017, 12:25am

The package purrrlyr have a function that gives a pmap equivalent to mdply. Thus giving you the following code

library(tidyverse)
library(purrrlyr)

weibull <- function(alpha, lambda, time){
  data.frame(time = time, value = alpha*lambda*(time^(alpha-1)))
}

params <- tidyr::crossing(lambda = c(1, 2, 4, 8, 16), alpha = c(0.5, 1))

params %>% purrrlyr::invoke_rows(.f = weibull, time = seq(0, 2, length=100), 
                                 .collate = "rows") %>% 
  ggplot(aes(time, value, colour=factor(lambda)))+
    facet_wrap(~alpha,scales="free", ncol=2) + geom_line()

alistaire · September 15, 2017, 12:59am

If you just want one column like this, it's easy enough to subset inside mutate:

library(tidyverse)

weibull <- function(alpha, lambda, time){
    data.frame(time = time, 
               value = alpha * lambda * (time ^ (alpha - 1)))
}

crossing(lambda = c(1, 2, 4, 8, 16), 
         alpha = c(0.5, 1), 
         time = seq(0, 2, length = 100)) %>% 
    mutate(value = pmap_df(., weibull)$value) %>% 
    ggplot(aes(time, value, colour = factor(lambda))) +
    facet_wrap(~alpha, scales = "free", ncol = 2) + 
    geom_line()

...but that's not a general solution to the problem. To preserve all the data you could wrap the pmap_df in bind_cols:

crossing(lambda = c(1, 2, 4, 8, 16), 
         alpha = c(0.5, 1), 
         time = seq(0, 2, length = 100)) %>% 
    bind_cols(pmap_df(., weibull))
#> # A tibble: 1,000 x 5
#>    lambda alpha       time      time1    value
#>     <dbl> <dbl>      <dbl>      <dbl>    <dbl>
#>  1      1   0.5 0.00000000 0.00000000      Inf
#>  2      1   0.5 0.02020202 0.02020202 3.517812
#>  3      1   0.5 0.04040404 0.04040404 2.487469
#>  4      1   0.5 0.06060606 0.06060606 2.031010
#>  5      1   0.5 0.08080808 0.08080808 1.758906
#>  6      1   0.5 0.10101010 0.10101010 1.573213
#>  7      1   0.5 0.12121212 0.12121212 1.436141
#>  8      1   0.5 0.14141414 0.14141414 1.329608
#>  9      1   0.5 0.16161616 0.16161616 1.243734
#> 10      1   0.5 0.18181818 0.18181818 1.172604
#> # ... with 990 more rows

...but that seems fragile, even though logically I'm not sure how it would fail.

baptiste · September 15, 2017, 1:03am

Thanks, but it appears to be deprecated

github.com

hadley/purrrlyr/blob/eef6685db8841a246da28f0eb5e3ca1e9536cb8a/R/rows.R#L225


}
if (length(.d) == 0) {
  stop("Data frame is empty", call. = FALSE)
}
}


#' @rdname by_row
#' @export
invoke_rows <- function(.f, .d, ..., .collate = c("list", "rows", "cols"),
                      .to = ".out", .labels = TRUE) {
deprecate("`invoke_rows()` is deprecated; please use `pmap()` instead.")
check_df_consistency(.d)
.collate <- match.arg(.collate)


.unique_labels <- 0
.labels_cols <- .d
.slicing_cols <- .d


.Call(invoke_rows_impl, environment(), ".d", ".f")
}

baptiste · September 15, 2017, 1:07am

Thanks – can you explain what happens in bind_cols(pmap_df(., weibull))?

I'm not quite sure how to follow the logic when the pipe is combined with function composition like this; how do I make sense of what does bind_cols bind together?

alistaire · September 15, 2017, 1:22am

It's piping in the data.frame resulting from crossing, which is being cbinded to the results of that same data.frame being passed to weibull row-by-row and bound into a data.frame by pmap_df.

That's a little dense, but writing it without pipes may make it somewhat clearer:

crossed_params <- crossing(lambda = c(1, 2, 4, 8, 16), 
                           alpha = c(0.5, 1), 
                           time = seq(0, 2, length = 100))

bind_cols(crossed_params, 
          pmap_df(crossed_params, weibull))

where the pmap_df call returns the rbinded results of the weibull call for each row.

baptiste · September 15, 2017, 2:27am

Thanks, but it's still not transparent to me how this works (to be more precise: I see how it works, but I wonder if is it reliable, by design, and if it will still work in the future? – as opposed to a happy accident).

?bin_cols states

"When column-binding, rows are matched by position, not value so all data frames must have the same number of rows."

so I guess it's considering the results of pmap_df as a list of nested data.frames, but it's not obvious (because pmap_df by itself collapses the result to one data.frame, with many more rows than params).

alistaire · September 15, 2017, 3:10am

pmap_df returns a single unnested data.frame (pmap would return a list of data.frames, which could be added as a list column like the final option in the original post, but not passed to bind_cols). Because weibull returns one row per call, the result of pmap_df will have the same number of rows as the parameter data.frame (which must be fully expanded beforehand), and so can be cbinded to it, provided pmap_df doesn't alter row order (which it shouldn't).

So given a function that returns a one-row data.frame or a list where each elements is length 1, this approach should be reliable. That said, I'm with you as I mentioned:

Column binding anything always feels fragile, and I'd always rather have a solution which unambiguously preserves row relationships throughout. This is not that solution; the list-column approach is more robust in that way, if more verbose.

baptiste · September 15, 2017, 3:20am

Oops, I didn't realise you'd changed params to include the time variable! This is definitely not an ideal solution if weibull (in this example) is vectorised over the time variable. I didn't make this clear in the original (dummy) example, but this is what distinguishes time from lambda and alpha.

alistaire · September 15, 2017, 3:38am

Sorry, misunderstood! In that case the Hadley-approved idiom is list columns, despite my awkward pleas.

baptiste · September 15, 2017, 4:00am

Hmm, now I'm curious because I once posted a similar-looking (to me, at least) mutate(pmap(...)) %>% unnest() attempt on twitter and Hadley commented it was not tidyverse-like. Now I'm really not sure when those list-columns are considered good or bad practice.