Spread- why errors


#1

Many times I get the following error message when I use the spread function: “Error: Duplicate identifiers for rows” and inevitably have to use dcast. Why does dcast work and spread kick back errors when I use the same data set?


Recoding multiple columns using "long" data
#2

There’s a really nice explanation by @alistaire here (just a quick excerpt below)

spread doesn’t apply a function to combine multiple values (à la dcast), so rows must be indexed so there’s one or zero values for a location

https://stackoverflow.com/a/45898919/785190


#3

This would be a minimal example of code that raises this error.

library(dplyr)
library(tidyr)
wide <- data_frame(
  number = c(1:3, 1:3),
  value = letters[1:6])
wide
#> # A tibble: 6 x 2
#>   number value
#>    <int> <chr>
#> 1      1     a
#> 2      2     b
#> 3      3     c
#> 4      1     d
#> 5      2     e
#> 6      3     f

long <- wide %>% 
  gather(variable, value, -number)
long
#> # A tibble: 6 x 3
#>   number variable value
#>    <int>    <chr> <chr>
#> 1      1    value     a
#> 2      2    value     b
#> 3      3    value     c
#> 4      1    value     d
#> 5      2    value     e
#> 6      3    value     f

long %>% 
  spread(variable, value)
#> Error: Duplicate identifiers for rows (1, 4), (2, 5), (3, 6)

One way to think about the problem is that there is more than one way to spread the data when there are repeated identifiers. For example, which of these is correct?

#> # A tibble: 6 x 2
#>   number value
#>    <int> <chr>
#> 1      1     a
#> 2      2     b
#> 3      3     c
#> 4      1     d
#> 5      2     e
#> 6      3     f

#> # A tibble: 6 x 2
#>   number value
#>    <int> <chr>
#> 1      1     d
#> 2      2     e
#> 3      3     f
#> 4      1     a
#> 5      2     b
#> 6      3     c

I’m not sure if this is the exact reason why rows must be uniquely identifiable, but it makes sense. tidyr is a little stricter in some ways than reshape2.

Fortunately, we can make rows uniquely identifiable by just numbering them with something like tibble::rowid_to_column().

long <- wide %>% 
  tibble::rowid_to_column() %>% 
  gather(variable, value, -number, -rowid)

long %>% 
  spread(variable, value)

#4

It may be worth putting in a pull request to tidyr to have a more friendly error message that states:

long %>% 
  spread(variable, value)
#> Error: Duplicate identifiers for rows (1, 4), (2, 5), (3, 6). 
#> Consider adding a unique ID with tibble::rowid_to_column()

At it’s core, tidyr is assuming that your dataset is in 3NF, or third normal form where there is a unique identifier, or at least a combination of values that could form a unique identifier for each row(which could easily be created via a simple mutate call), but when it cannot find an identifier, it simply tells you it is not possible.


#5

…except that’s rarely the identifier column actually needed, as spreading afterwards just shifts values sideways and inserts a lot of NAs. What actually is needed seems to vary a lot depending on the data in question.

As a side note, I’d really like tibble::rownames_to_column to be able to run type.convert in case there are no names or row names are otherwise numeric. It wouldn’t be type-safe, but it could be put in as a convert = TRUE parameter like that of spread and gather. I’d write the PR if the maintainers would accept it, so I guess I should open an issue, but that’s a lot of typing and I’m tired.


#6

On its own, tibble::rowid_to_column() assigns a unique id to every single row, which actually cancels out most of the spreading. New columns are created, but there are as many rows as before, and there are more NAs than I want. Instead, I use group_by(...) %>% do(tibble::rowid_to_column(.)) %>% spread(...). That way, id’s are only unique where they disambiguate rows, and spread() seems to do what I want.

Also, in my situation, the following are equivalent and row order does not matter.

#> # A tibble: 6 x 2
#>   number value
#>    <int> <chr>
#> 1      1     a
#> 2      2     b
#> 3      3     c
#> 4      1     d
#> 5      2     e
#> 6      3     f

#> # A tibble: 6 x 2
#>   number value
#>    <int> <chr>
#> 1      1     d
#> 2      2     e
#> 3      3     f
#> 4      1     a
#> 5      2     b
#> 6      3     c

#7

I am not convinced that the current solution is acceptable. I find it very difficult to explain to newcomers why spread has difficulties spreading what gather has just gathered.

Lets take example from SO question @mara quoted above:

 library(tidyverse)
 df <- structure(list(age = c("21", "17", "32", "29", "15"), 
                        gender = structure(c(2L, 1L, 1L, 2L, 2L), .Label = c("Female", "Male"), class = "factor")), 
                   row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("age", "gender"))
 df
#> # A tibble: 5 x 2
#>   age   gender
#>   <chr> <fct> 
#> 1 21    Male  
#> 2 17    Female
#> 3 32    Female
#> 4 29    Male  
#> 5 15    Male  

df %>% 
  spread(key=gender, value=age)
#> Error: Duplicate identifiers for rows (2, 3), (1, 4, 5)

Ok, so for some strange reason tidyr needs another index to do the spread. Fine! Why can’t it intercept the call and do something along the lines of:

df %>% 
  group_by_at(vars(-age)) %>%  # group by everything other than the value column. 
  mutate(row_id=1:n()) %>% ungroup() %>%  # build group index
  spread(key=gender, value=age) %>%    # spread
  select(-row_id)  # drop the index

#> # A tibble: 3 x 2
#>   Female Male 
#>   <chr>  <chr>
#> 1 17     21   
#> 2 32     29   
#> 3 NA     15   

Am I missing something?


#8

Thanks so much dmi3k! Your solution for my tidyr “spread” problem worked like a champ on the data set I had in mind in my original October 2017 posting. Absolute keeper!!