pivot longer with multiple observations per row

njtierney · July 22, 2020, 5:06am

Hi folks!

I have a question - I have toy some data I collected on whether people could accurately identify skittle colours below, and I think I've got an answer to tidying this, but it involves an extra step using rename.

I notes that it was quite similar to https://tidyr.tidyverse.org/articles/pivot.html#multiple-observations-per-row-1, but I couldn't quite get there.

I was wondering if anyone has some suggestions on how to improve this so I can do it all in one with pivot_longer ? I feel like I'm missing something.

Below is the reprex:

skittles_raw <- tibble::tribble(
  ~skittle_type, ~person_a, ~order_a, ~person_b, ~order_b, ~person_c, ~order_c,
              1,  "purple",        4,     "red",        9,  "purple",        6,
              1,     "red",       10,   "green",       10,     "red",       10,
              2,  "yellow",        2,  "orange",        4,  "orange",        1,
              2,  "orange",        5,  "orange",        8,   "green",        9,
              3,  "orange",        3,  "orange",        2,  "orange",        2,
              3,     "red",        7,  "yellow",        7,  "yellow",        5,
              4,  "purple",        8,   "green",        1,     "red",        3,
              4,  "yellow",        9,     "red",        3,  "yellow",        4,
              5,  "purple",        1,  "purple",        5,  "purple",        7,
              5,   "green",        6,  "purple",        6,  "purple",        8
  )

library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

skittles_raw_not_quite <- skittles_raw %>% 
  pivot_longer(cols = -skittle_type,
               names_to = c(".value", "person"),
               names_sep = "_",
               values_to = "choice",
               values_drop_na = TRUE,
               names_repair = "unique")
#> New names:
#> * person -> person...2
#> * person -> person...3

skittles_raw_not_quite
#> # A tibble: 30 x 4
#>    skittle_type person...2 person...3 order
#>           <dbl> <chr>      <chr>      <dbl>
#>  1            1 a          purple         4
#>  2            1 b          red            9
#>  3            1 c          purple         6
#>  4            1 a          red           10
#>  5            1 b          green         10
#>  6            1 c          red           10
#>  7            2 a          yellow         2
#>  8            2 b          orange         4
#>  9            2 c          orange         1
#> 10            2 a          orange         5
#> # … with 20 more rows

# I get what I want when I do this:
skittles_raw_not_quite %>% 
  rename(person = person...2,
         choice = person...3)
#> # A tibble: 30 x 4
#>    skittle_type person choice order
#>           <dbl> <chr>  <chr>  <dbl>
#>  1            1 a      purple     4
#>  2            1 b      red        9
#>  3            1 c      purple     6
#>  4            1 a      red       10
#>  5            1 b      green     10
#>  6            1 c      red       10
#>  7            2 a      yellow     2
#>  8            2 b      orange     4
#>  9            2 c      orange     1
#> 10            2 a      orange     5
#> # … with 20 more rows

# but how do I get there without this extra step?

^{Created on 2020-07-22 by the reprex package (v0.3.0)}

Also available at https://gist.github.com/njtierney/b40724d6a00a7a857d9fd641f9a9b568/edit

siddharthprabhu · July 22, 2020, 6:15am

I think the rename() is unavoidable since pivot_longer() can only construct variable names from those already present in the data (and there is nothing named choice). Your code would work perfectly if the variable names were choice_a, choice_b instead of person_a, person_b etc.

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

ideal_skittles_raw <- tribble(
  ~skittle_type, ~choice_a, ~order_a, ~choice_b, ~order_b, ~choice_c, ~order_c,
  1, "purple", 4, "red", 9, "purple", 6,
  1, "red", 10, "green", 10, "red", 10,
  2, "yellow", 2, "orange", 4, "orange", 1,
  2, "orange", 5, "orange", 8, "green", 9,
  3, "orange", 3, "orange", 2, "orange", 2,
  3, "red", 7, "yellow", 7, "yellow", 5,
  4, "purple", 8, "green", 1, "red", 3,
  4, "yellow", 9, "red", 3, "yellow", 4,
  5, "purple", 1, "purple", 5, "purple", 7,
  5, "green", 6, "purple", 6, "purple", 8
)

ideal_skittles_raw %>%
  pivot_longer(
    cols = -skittle_type,
    names_to = c(".value", "person"),
    names_sep = "_",
    values_to = "choice",
    values_drop_na = TRUE,
    names_repair = "unique"
  )
#> # A tibble: 30 x 4
#>    skittle_type person choice order
#>           <dbl> <chr>  <chr>  <dbl>
#>  1            1 a      purple     4
#>  2            1 b      red        9
#>  3            1 c      purple     6
#>  4            1 a      red       10
#>  5            1 b      green     10
#>  6            1 c      red       10
#>  7            2 a      yellow     2
#>  8            2 b      orange     4
#>  9            2 c      orange     1
#> 10            2 a      orange     5
#> # ... with 20 more rows

^{Created on 2020-07-22 by the reprex package (v0.3.0)}

I could be wrong of course and would love to see a way of achieving this in a single step.

enixam · July 22, 2020, 6:20am

As an aside, I believe values_to is reduntdant when you have .values in names_to

njtierney · July 22, 2020, 6:36am

You're totally right! values_to is ignored when a sentinal value ".value" is included.

mgirlich · July 22, 2020, 6:37am

The alternative would be to directly create the spec resp. modify it. With the spec one can also easily see that the column names person_x are not really correct and should be choice_x

skittles_raw <- tibble::tribble(
  ~skittle_type, ~person_a, ~order_a, ~person_b, ~order_b, ~person_c, ~order_c,
              1,  "purple",        4,     "red",        9,  "purple",        6,
              1,     "red",       10,   "green",       10,     "red",       10,
              2,  "yellow",        2,  "orange",        4,  "orange",        1,
              2,  "orange",        5,  "orange",        8,   "green",        9,
              3,  "orange",        3,  "orange",        2,  "orange",        2,
              3,     "red",        7,  "yellow",        7,  "yellow",        5,
              4,  "purple",        8,   "green",        1,     "red",        3,
              4,  "yellow",        9,     "red",        3,  "yellow",        4,
              5,  "purple",        1,  "purple",        5,  "purple",        7,
              5,   "green",        6,  "purple",        6,  "purple",        8
  )

library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

spec <- tribble(
  ~ .name,    ~ .value, ~ person,
  "person_a", "choice", "a",
  "order_a", "order", "a",
  "person_b", "choice", "b",
  "order_b", "order", "b",
  "person_c", "choice", "c",
  "order_c", "order", "c"
)

skittles_raw %>% 
  pivot_longer_spec(spec, names_repair = "unique")
#> # A tibble: 30 x 4
#>    skittle_type person choice order
#>           <dbl> <chr>  <chr>  <dbl>
#>  1            1 a      purple     4
#>  2            1 b      red        9
#>  3            1 c      purple     6
#>  4            1 a      red       10
#>  5            1 b      green     10
#>  6            1 c      red       10
#>  7            2 a      yellow     2
#>  8            2 b      orange     4
#>  9            2 c      orange     1
#> 10            2 a      orange     5
#> # … with 20 more rows

^{Created on 2020-07-22 by the reprex package (v0.3.0)}

njtierney · July 22, 2020, 6:42am

Oh nice! I haven't really used spec before, this feels like a nice example usecase, thanks for taking the time to write it up

With the spec one can also easily see that the column names person_x are not really correct and should be choice_x

Interesting point - at the time I recorded the data I decided this "would do for now", and for sure it wasn't ideal, but I'm actually not sure I follow when you say that they should instead be choice_x over person_x, functionally, what is the difference?

siddharthprabhu · July 22, 2020, 7:04am

I think the point is that the variable name person_x doesn't accurately describe its contents; it doesn't contain the names of people (which are a, b and c in this case). It could be thought of as "the choice of person x".

njtierney · July 22, 2020, 8:16am

Indeed, but I guess what I mean is that functionally, there isn't a reason why the code would/wouldn't work if it were named "choice" over "person", right?

siddharthprabhu · July 22, 2020, 8:41am

It does make a difference. You can see it in the spec itself. The contents of .value are derived from the part in .name that precedes _. Since the variable is named person_x, its corresponding .value is person and you need to manually change it to choice. This would happen automatically if the variables were named choice_x. Hope that explains.

njtierney · July 23, 2020, 8:10am

Right you are! I totally get it now. Thanks for explaining

The spec is pretty amazing, that is a nice insight to get from it.

To go full round trip here - changing person_x to choice_x solves the problem, and allows me to write pivot_longer code that is perhaps a bit easier to follow than using spec.

skittles_raw <- tibble::tribble(
  ~skittle_type, ~choice_a, ~order_a, ~choice_b, ~order_b, ~choice_c, ~order_c,
              1,  "purple",        4,     "red",        9,  "purple",        6,
              1,     "red",       10,   "green",       10,     "red",       10,
              2,  "yellow",        2,  "orange",        4,  "orange",        1,
              2,  "orange",        5,  "orange",        8,   "green",        9,
              3,  "orange",        3,  "orange",        2,  "orange",        2,
              3,     "red",        7,  "yellow",        7,  "yellow",        5,
              4,  "purple",        8,   "green",        1,     "red",        3,
              4,  "yellow",        9,     "red",        3,  "yellow",        4,
              5,  "purple",        1,  "purple",        5,  "purple",        7,
              5,   "green",        6,  "purple",        6,  "purple",        8
  )

library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

skittles_raw %>% 
pivot_longer(cols = -skittle_type,
             names_to = c(".value", "person"),
             names_sep = "_")
#> # A tibble: 30 x 4
#>    skittle_type person choice order
#>           <dbl> <chr>  <chr>  <dbl>
#>  1            1 a      purple     4
#>  2            1 b      red        9
#>  3            1 c      purple     6
#>  4            1 a      red       10
#>  5            1 b      green     10
#>  6            1 c      red       10
#>  7            2 a      yellow     2
#>  8            2 b      orange     4
#>  9            2 c      orange     1
#> 10            2 a      orange     5
#> # … with 20 more rows

^{Created on 2020-07-23 by the reprex package (v0.3.0)}

system · July 30, 2020, 8:10am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.