Spread- why errors

Many times I get the following error message when I use the spread function: "Error: Duplicate identifiers for rows" and inevitably have to use dcast. Why does dcast work and spread kick back errors when I use the same data set?

2 Likes

There's a really nice explanation by @alistaire here (just a quick excerpt below)

spread doesn't apply a function to combine multiple values (à la dcast), so rows must be indexed so there's one or zero values for a location

2 Likes

This would be a minimal example of code that raises this error.

library(dplyr)
library(tidyr)
wide <- data_frame(
  number = c(1:3, 1:3),
  value = letters[1:6])
wide
#> # A tibble: 6 x 2
#>   number value
#>    <int> <chr>
#> 1      1     a
#> 2      2     b
#> 3      3     c
#> 4      1     d
#> 5      2     e
#> 6      3     f

long <- wide %>% 
  gather(variable, value, -number)
long
#> # A tibble: 6 x 3
#>   number variable value
#>    <int>    <chr> <chr>
#> 1      1    value     a
#> 2      2    value     b
#> 3      3    value     c
#> 4      1    value     d
#> 5      2    value     e
#> 6      3    value     f

long %>% 
  spread(variable, value)
#> Error: Duplicate identifiers for rows (1, 4), (2, 5), (3, 6)

One way to think about the problem is that there is more than one way to spread the data when there are repeated identifiers. For example, which of these is correct?

#> # A tibble: 6 x 2
#>   number value
#>    <int> <chr>
#> 1      1     a
#> 2      2     b
#> 3      3     c
#> 4      1     d
#> 5      2     e
#> 6      3     f

#> # A tibble: 6 x 2
#>   number value
#>    <int> <chr>
#> 1      1     d
#> 2      2     e
#> 3      3     f
#> 4      1     a
#> 5      2     b
#> 6      3     c

I'm not sure if this is the exact reason why rows must be uniquely identifiable, but it makes sense. tidyr is a little stricter in some ways than reshape2.

Fortunately, we can make rows uniquely identifiable by just numbering them with something like tibble::rowid_to_column().

long <- wide %>% 
  tibble::rowid_to_column() %>% 
  gather(variable, value, -number, -rowid)

long %>% 
  spread(variable, value)
2 Likes

It may be worth putting in a pull request to tidyr to have a more friendly error message that states:

long %>% 
  spread(variable, value)
#> Error: Duplicate identifiers for rows (1, 4), (2, 5), (3, 6). 
#> Consider adding a unique ID with tibble::rowid_to_column()

At it's core, tidyr is assuming that your dataset is in 3NF, or third normal form where there is a unique identifier, or at least a combination of values that could form a unique identifier for each row(which could easily be created via a simple mutate call), but when it cannot find an identifier, it simply tells you it is not possible.

2 Likes

...except that's rarely the identifier column actually needed, as spreading afterwards just shifts values sideways and inserts a lot of NAs. What actually is needed seems to vary a lot depending on the data in question.

As a side note, I'd really like tibble::rownames_to_column to be able to run type.convert in case there are no names or row names are otherwise numeric. It wouldn't be type-safe, but it could be put in as a convert = TRUE parameter like that of spread and gather. I'd write the PR if the maintainers would accept it, so I guess I should open an issue, but that's a lot of typing and I'm tired.

On its own, tibble::rowid_to_column() assigns a unique id to every single row, which actually cancels out most of the spreading. New columns are created, but there are as many rows as before, and there are more NAs than I want. Instead, I use group_by(...) %>% do(tibble::rowid_to_column(.)) %>% spread(...). That way, id's are only unique where they disambiguate rows, and spread() seems to do what I want.

Also, in my situation, the following are equivalent and row order does not matter.

#> # A tibble: 6 x 2
#>   number value
#>    <int> <chr>
#> 1      1     a
#> 2      2     b
#> 3      3     c
#> 4      1     d
#> 5      2     e
#> 6      3     f

#> # A tibble: 6 x 2
#>   number value
#>    <int> <chr>
#> 1      1     d
#> 2      2     e
#> 3      3     f
#> 4      1     a
#> 5      2     b
#> 6      3     c
3 Likes

I am not convinced that the current solution is acceptable. I find it very difficult to explain to newcomers why spread has difficulties spreading what gather has just gathered.

Lets take example from SO question @mara quoted above:

 library(tidyverse)
 df <- structure(list(age = c("21", "17", "32", "29", "15"), 
                        gender = structure(c(2L, 1L, 1L, 2L, 2L), .Label = c("Female", "Male"), class = "factor")), 
                   row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("age", "gender"))
 df
#> # A tibble: 5 x 2
#>   age   gender
#>   <chr> <fct> 
#> 1 21    Male  
#> 2 17    Female
#> 3 32    Female
#> 4 29    Male  
#> 5 15    Male  

df %>% 
  spread(key=gender, value=age)
#> Error: Duplicate identifiers for rows (2, 3), (1, 4, 5)

Ok, so for some strange reason tidyr needs another index to do the spread. Fine! Why can't it intercept the call and do something along the lines of:

df %>% 
  group_by_at(vars(-age)) %>%  # group by everything other than the value column. 
  mutate(row_id=1:n()) %>% ungroup() %>%  # build group index
  spread(key=gender, value=age) %>%    # spread
  select(-row_id)  # drop the index

#> # A tibble: 3 x 2
#>   Female Male 
#>   <chr>  <chr>
#> 1 17     21   
#> 2 32     29   
#> 3 NA     15   

Am I missing something?

6 Likes

Thanks so much dmi3k! Your solution for my tidyr "spread" problem worked like a champ on the data set I had in mind in my original October 2017 posting. Absolute keeper!!

1 Like

@dmi3k

Made an account just to say thank you for providing this work around!

1 Like

@dmi3k I also just made an account to thank you!

Yup, thank you. Came here from an error in scanstatistics::df_to_matrix which was using spread like this and both a) the super helpful minimal example and b) the solution were useful in identifying and fixing the problem. Somebody should really get that pull request to TidyR though

Very glad that you were able to get squared away. @dmi3k's solutuion was perfect. Have never had to worry about those nasty "Duplicate identifiers" since.

This isn't the default behavior because it assumes a significance to row order within groups that is not explicit in the data itself. (That's the row_id index it creates.) In this case it's pretty dubious there's a connection between 17 and 21 and thus that the rows in wide form mean anything at all. This data is better left in long form.

There are cases where the assumption may be good, but that means the data is not tidy, as there is a variable which is not explicit.