What is the best way to pivot_wider() and summarise() at the same time?

Suppose I have this data:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

d <- data.frame(
  id = rep(1:5, each = 3),
  update = c(as.Date("2020-07-01") + 1:15),
  key = rep(c("a", "b", "c"), 5),
  val = 1:15
)

d
#>    id     update key val
#> 1   1 2020-07-02   a   1
#> 2   1 2020-07-03   b   2
#> 3   1 2020-07-04   c   3
#> 4   2 2020-07-05   a   4
#> 5   2 2020-07-06   b   5
#> 6   2 2020-07-07   c   6
#> 7   3 2020-07-08   a   7
#> 8   3 2020-07-09   b   8
#> 9   3 2020-07-10   c   9
#> 10  4 2020-07-11   a  10
#> 11  4 2020-07-12   b  11
#> 12  4 2020-07-13   c  12
#> 13  5 2020-07-14   a  13
#> 14  5 2020-07-15   b  14
#> 15  5 2020-07-16   c  15

and want to

  • pivot key to columns and
  • get the latest update of each id

at the same time.

I hoped pivot_wider() chops non-id_cols column into a list column, so that I can mutate() it later, but actually it doesn't (it just drops).

d %>% 
  pivot_wider(id,
    names_from = key,
    values_from = val
  ) %>%
  rowwise() %>%
  mutate(last_update = max(update), .keep = "unused")

# or simply this
d %>% 
  pivot_wider(id,
    names_from = key,
    values_from = val,
    values_fn = max
  )

Are there some nicer way to do this? Does it seem reasonable to file a feature request on tidyr's repo?

This is the expected result, and the best way I can come up with at the moment:

d %>% 
  # summarise `update` by using `mutate()`
  group_by(id) %>% 
  mutate(last_update = max(update)) %>% 
  ungroup() %>% 
  # choose `id` and `last_update` as key
  pivot_wider(c(id, last_update),
    names_from = key,
    values_from = val
  )
#> # A tibble: 5 x 5
#>      id last_update     a     b     c
#>   <int> <date>      <int> <int> <int>
#> 1     1 2020-07-04      1     2     3
#> 2     2 2020-07-07      4     5     6
#> 3     3 2020-07-10      7     8     9
#> 4     4 2020-07-13     10    11    12
#> 5     5 2020-07-16     13    14    15

Created on 2020-07-04 by the reprex package (v0.3.0)

close...

#list which columns to get the max of
d %>% 
  pivot_wider(id,
              names_from = key,
              values_from = val
  ) %>% summarise_at(c("a","b","c")
                     ,max)

# or 
# list which column not to get the max of
d %>% 
  pivot_wider(id,
              names_from = key,
              values_from = val
  ) %>% summarise_at(setdiff(names(.),"id")
                     ,max)

That doesn't do what the OP wants. He is asking for a way to arrive at the 5 x 5 tibble in his post with a single call to pivot_wider() or at least in a more elegant manner than having to first calculate the max value before calling pivot_wider().

@yutannihilation I couldn't come up with a better way of doing this besides the grouped mutate. You could try filing a feature request on the tidyr repository.

Thanks for pointing it out.

d %>% 
  pivot_wider(id,
              names_from = key,
              values_from = c(val,update),
              values_fn = max
  ) %>% mutate(last_update=pmax(!!!syms(names(.)[startsWith(names(.) ,
                                                            "update")]))) %>%
  select(-names(.)[startsWith(names(.) ,
                                      "update")])

     id val_a val_b val_c last_update
  <int> <int> <int> <int> <date>     
1     1     1     2     3 2020-07-04 
2     2     4     5     6 2020-07-07 
3     3     7     8     9 2020-07-10 
4     4    10    11    12 2020-07-13 
5     5    13    14    15 2020-07-16 

gnarly :laughing:

not general, but succinct:

d %>% 
  pivot_wider(id,
              names_from = key,
              values_from = c(val,update),
              values_fn = max
  ) %>% select(1:4,7)

Sorry, my post was not clear about which is the expected result. Thanks @siddharthprabhu for helping!

I ended up filing a feature request: https://github.com/tidyverse/tidyr/issues/990

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.