tidyr 1.2.0

This is a companion discussion topic for the original entry at https://www.tidyverse.org/blog/2022/02/tidyr-1-2-0


We’re chuffed to announce the release of tidyr 1.2.0. tidyr provides a set of tools for transforming data frames to and from tidy data, where each variable is a column and each observation is a row. Tidy data is a convention for matching the semantics and structure of your data that makes using the rest of the tidyverse (and many other R packages) much easier.

You can install it from CRAN with:

install.packages("tidyr")

This blog post will go over the main new features, which include four new arguments to pivot_wider(), the ability to unnest multiple columns at once in unnest_wider() and unnest_longer(), an enhanced complete() function, and some updates to our tools for handling missing values.

You can see a full list of changes in the release notes, where you’ll also find details on the ~50 bugs that were fixed in this release!

library(tidyr)
library(dplyr, warn.conflicts = FALSE)

New author

First off, we are very excited to welcome Maximilian Girlich as a new tidyr author in recognition of his significant and sustained contributions. In particular, he played a large part in speeding up a number of core functions, including: unchop(), unnest(), unnest_wider(), and unnest_longer(). Additionally, he provided proof-of-concept implementations for a few new features, like the unused_fn argument to pivot_wider() discussed below.

Pivoting

Value expansion

pivot_wider() has gained two new arguments related to the expansion of values. These arguments are similar to drop = FALSE from spread(), but are a bit more fine grained. As you’ll see, these are mostly useful when you have factors in either names_from or id_cols and want to ensure that all of the factor levels are retained.

weekdays <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")

daily <- tibble(
day = factor(c("Tue", "Thu", "Fri", "Mon"), levels = weekdays),
value = c(2, 3, 1, 5)
)

daily
#> # A tibble: 4 × 2
#> day value
#>
#> 1 Tue 2
#> 2 Thu 3
#> 3 Fri 1
#> 4 Mon 5

Imagine you’d like to pivot the values from day into columns, filling the cells with value. By default, pivot_wider() only generates columns from the data that is actually there, and will retain the ordering that was present in the data.

pivot_wider(daily, names_from = day, values_from = value)
#> # A tibble: 1 × 4
#>     Tue   Thu   Fri   Mon
#>      
#> 1     2     3     1     5

When you know the full set of possible values and have encoded them as factor levels (as we have done here), you might want to retain those levels in the pivot, even if there isn’t any data. Additionally, it would probably be nice if they were sorted to match the levels found in the factor. The new names_expand argument handles both of these.

pivot_wider(daily, names_from = day, values_from = value, names_expand = TRUE)
#> # A tibble: 1 × 7
#>     Mon   Tue   Wed   Thu   Fri   Sat   Sun
#>         
#> 1     5     2    NA     3     1    NA    NA

A related problem can occur when there are implicit missing factor levels in the id_cols. When this happens, there are missing rows (rather than columns) that you’d like to explicitly represent. To demonstrate, we’ll modify daily with a type column, and pivot on that instead, keeping day as an identifier column.

daily <- daily %>%
  mutate(type = c("A", "B", "B", "A"))

daily
#> # A tibble: 4 × 3
#> day value type
#>
#> 1 Tue 2 A
#> 2 Thu 3 B
#> 3 Fri 1 B
#> 4 Mon 5 A

In the pivot below, we are missing some rows corresponding to the missing factor levels of day. Again, by default pivot_wider() will only use data that already exists in the id_cols.

pivot_wider(
  daily, 
  names_from = type, 
  values_from = value
)
#> # A tibble: 4 × 3
#>   day       A     B
#>     
#> 1 Tue       2    NA
#> 2 Thu      NA     3
#> 3 Fri      NA     1
#> 4 Mon       5    NA

To explicitly expand (and sort) these missing rows, we can use id_expand, which works much the same way as names_expand. We will also go ahead and fill the unrepresented values with zeros.

pivot_wider(
  daily, 
  id_expand = TRUE,
  names_from = type, 
  values_from = value,
  values_fill = 0
)
#> # A tibble: 7 × 3
#>   day       A     B
#>     
#> 1 Mon       5     0
#> 2 Tue       2     0
#> 3 Wed       0     0
#> 4 Thu       0     3
#> 5 Fri       0     1
#> 6 Sat       0     0
#> 7 Sun       0     0

Varying names

When you specify multiple values_from columns, the resulting column names that get generated from the combination of names_from values and values_from names default to varying the names_from values fastest. This means that all of the columns related to the first values_from column will be at the front, followed by the columns related to the second values_from column, and so on. For example, if we wanted to flatten daily all the way out to a single row by specifying values_from = c(value, type), then we would end up with all the columns related to value followed by those related to type.

pivot_wider(
  daily,
  names_from = day,
  values_from = c(value, type),
  names_expand = TRUE
)
#> # A tibble: 1 × 14
#>   value_Mon value_Tue value_Wed value_Thu value_Fri value_Sat value_Sun type_Mon
#>                                         
#> 1         5         2        NA         3         1        NA        NA A       
#> # … with 6 more variables: type_Tue , type_Wed , type_Thu ,
#> #   type_Fri , type_Sat , type_Sun 

Depending on your data, you might instead want to group all of the columns related to a particular names_from value together. In this example, that would mean grouping all of the columns related to Monday together, followed by Tuesday, Wednesday, etc. You can accomplish this with the new names_vary argument, which allows you to vary the names_from values slowest.

pivot_wider(
  daily,
  names_from = day,
  values_from = c(value, type),
  names_expand = TRUE,
  names_vary = "slowest"
)
#> # A tibble: 1 × 14
#>   value_Mon type_Mon value_Tue type_Tue value_Wed type_Wed value_Thu type_Thu
#>                                      
#> 1         5 A                2 A               NA NA               3 B       
#> # … with 6 more variables: value_Fri , type_Fri , value_Sat ,
#> #   type_Sat , value_Sun , type_Sun 

Unused columns

Occasionally you’ll find yourself in a situation where you have columns in your data that are unrelated to the pivoting process itself, but you’d still like to retain some information about them. Consider this data set that records values returned by various systems across multiple counties.

readouts <- tibble(
  county = c("Wake", "Wake", "Wake", "Guilford", "Guilford"),
  date = c(as.Date("2020-01-01") + 0:2, as.Date("2020-01-03") + 0:1),
  system = c("A", "B", "C", "A", "C"),
  value = c(3.2, 4, 5.5, 2, 1.2)
)

readouts
#> # A tibble: 5 × 4
#> county date system value
#>
#> 1 Wake 2020-01-01 A 3.2
#> 2 Wake 2020-01-02 B 4
#> 3 Wake 2020-01-03 C 5.5
#> 4 Guilford 2020-01-03 A 2
#> 5 Guilford 2020-01-04 C 1.2

You might want to pivot this into a view containing one row per county, with the system types across the columns. You might do something like:

pivot_wider(
  readouts,
  id_cols = county,
  names_from = system,
  values_from = value
)
#> # A tibble: 2 × 4
#>   county       A     B     C
#>         
#> 1 Wake       3.2     4   5.5
#> 2 Guilford   2      NA   1.2

This worked, but in the process we’ve lost all of the information from the date column about when the values were recorded. To fix this, we can use the new unused_fn argument to retain a summary of the unused date column. In our case, we’ll retain the most recent date a value was recorded across all systems.

pivot_wider(
  readouts,
  id_cols = county,
  names_from = system,
  values_from = value,
  unused_fn = list(date = max)
)
#> # A tibble: 2 × 5
#>   county       A     B     C date      
#>              
#> 1 Wake       3.2     4   5.5 2020-01-03
#> 2 Guilford   2      NA   1.2 2020-01-04

If you want to retain the unused columns but delay the summarization entirely, you can use list() to wrap up the value into a list column.

pivot_wider(
  readouts,
  id_cols = county,
  names_from = system,
  values_from = value,
  unused_fn = list
)
#> # A tibble: 2 × 5
#>   county       A     B     C date      
#>              
#> 1 Wake       3.2     4   5.5 
#> 2 Guilford   2      NA   1.2 

Note that for unused_fn to work, you must supply id_cols explicitly, as otherwise all of the remaining columns are assumed to be id_cols.

More informative errors

We’ve improved on a number of the error messages throughout tidyr, but the error you get from pivot_wider() when you encounter values that aren’t uniquely identified is now especially nice. Let’s “accidentally” add a duplicate row to readouts.

readouts2 <- readouts %>%
  slice(seq_len(n()), n()) 

readouts2
#> # A tibble: 6 × 4
#> county date system value
#>
#> 1 Wake 2020-01-01 A 3.2
#> 2 Wake 2020-01-02 B 4
#> 3 Wake 2020-01-03 C 5.5
#> 4 Guilford 2020-01-03 A 2
#> 5 Guilford 2020-01-04 C 1.2
#> 6 Guilford 2020-01-04 C 1.2

Pivoting on system warns us that the values from value are not uniquely identified.

pivot_wider(
  readouts2,
  id_cols = county,
  names_from = system,
  values_from = value
)
#> Warning: Values from `value` are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = {summary_fun}` to summarise duplicates.
#> * Use the following dplyr code to identify duplicates.
#>   {data} %>%
#>     dplyr::group_by(county, system) %>%
#>     dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
#>     dplyr::filter(n > 1L)
#> # A tibble: 2 × 4
#>   county   A         B         C        
#>                  
#> 1 Wake       
#> 2 Guilford      

This provides us with a number of options, but the last one is particularly useful if we weren’t expecting duplicates. This prints out a block of dplyr code that you can use to quickly identify duplication issues. Replacing {data} with readouts2, we get:

readouts2 %>%
  dplyr::group_by(county, system) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  dplyr::filter(n > 1L) 
#> # A tibble: 1 × 3
#>   county   system     n
#>         
#> 1 Guilford C          2

(Un)nesting

unnest_longer() and unnest_wider() have both gained the ability to unnest multiple columns at once. This is particularly useful with unnest_longer(), where sequential unnesting would instead result in a Cartesian product, which isn’t typically desired.

df <- tibble(x = list(1, 1:2), y = list(1, 1:2))
df
#> # A tibble: 2 × 2
#>   x         y        
#>          
#> 1  
#> 2  
# Sequential unnesting
df %>%
  unnest_longer(x) %>%
  unnest_longer(y)
#> # A tibble: 5 × 2
#>       x     y
#>    
#> 1     1     1
#> 2     1     1
#> 3     1     2
#> 4     2     1
#> 5     2     2

# Joint unnesting
df %>%
unnest_longer(c(x, y))
#> # A tibble: 3 × 2
#> x y
#>
#> 1 1 1
#> 2 1 1
#> 3 2 2

Grids

When complete()-ing a data frame, it’s often useful to immediately fill the newly generated missing values with a value that better represents their intention. For example, with the daily data we could complete on the day factor column and insert zeros for value in any row that wasn’t previously represented.

daily
#> # A tibble: 4 × 3
#>   day   value type 
#>     
#> 1 Tue       2 A    
#> 2 Thu       3 B    
#> 3 Fri       1 B    
#> 4 Mon       5 A

daily %>%
complete(day, fill = list(value = 0))
#> # A tibble: 7 × 3
#> day value type
#>
#> 1 Mon 5 A
#> 2 Tue 2 A
#> 3 Wed 0 NA
#> 4 Thu 3 B
#> 5 Fri 1 B
#> 6 Sat 0 NA
#> 7 Sun 0 NA

But what if there were already missing values before completing? By default, complete() will still fill those explicit missing values too.

daily2 <- daily
daily2$value[nrow(daily2)] <- NA
daily2
#> # A tibble: 4 × 3
#>   day   value type 
#>     
#> 1 Tue       2 A    
#> 2 Thu       3 B    
#> 3 Fri       1 B    
#> 4 Mon      NA A

daily2 %>%
complete(day, fill = list(value = 0))
#> # A tibble: 7 × 3
#> day value type
#>
#> 1 Mon 0 A
#> 2 Tue 2 A
#> 3 Wed 0 NA
#> 4 Thu 3 B
#> 5 Fri 1 B
#> 6 Sat 0 NA
#> 7 Sun 0 NA

To avoid this, you can now retain pre-existing explicit missing values with the new explicit argument:

daily2 %>%
  complete(day, fill = list(value = 0), explicit = FALSE)
#> # A tibble: 7 × 3
#>   day   value type 
#>     
#> 1 Mon      NA A    
#> 2 Tue       2 A    
#> 3 Wed       0 NA   
#> 4 Thu       3 B    
#> 5 Fri       1 B    
#> 6 Sat       0 NA   
#> 7 Sun       0 NA

Missing values

The three core missing values functions, drop_na(), replace_na(), and fill(), have all been updated to utilize vctrs. This allows them to work properly with a wider variety of types, and makes them safer to use with some of the existing types that they already supported.

As an example, fill() now works properly with the Period types from lubridate:

library(lubridate, warn.conflicts = FALSE)

df <- tibble(x = seconds(c(1, 2, NA, 4, NA)))

df %>%
fill(x, .direction = "down")
#> # A tibble: 5 × 1
#> x
#>
#> 1 1S
#> 2 2S
#> 3 2S
#> 4 4S
#> 5 4S

And it now treats NaN like any other missing value:

df <- tibble(x = c(NaN, 2, NA, 3))

df %>%
fill(x, .direction = "up")
#> # A tibble: 4 × 1
#> x
#>
#> 1 2
#> 2 2
#> 3 3
#> 4 3

The most drastic improvement in safety comes to replace_na(). Previously, this relied on [<- to replace missing values with a replacement value, which is much laxer than vctrs in terms of what the replacement value can be. This resulted in the possibility for your column type to change depending on what your replacement value was.

# Notice that this is an integer column
df <- tibble(x = c(1L, NA, 3L))
df
#> # A tibble: 3 × 1
#>       x
#>   
#> 1     1
#> 2    NA
#> 3     3
# Previous behavior without vctrs:

# Integer column changed to character column
df %>% replace_na(list(x = "missing"))
#> # A tibble: 3 × 1
#> x
#>
#> 1 1
#> 2 missing
#> 3 3

# Integer column changed to double column
df %>% replace_na(list(x = 1))
#> # A tibble: 3 × 1
#> x
#>
#> 1 1
#> 2 1
#> 3 3

With vctrs, we now ensure that the replacement value is always cast to the type of the column you are replacing in. This ensures that the column types remain the same before and after you replace any missing values.

# New behavior with vctrs:

# Error, because "missing" can't be converted to an integer
df %>% replace_na(list(x = "missing"))
#> Error: Can't convert replace$x to match type of data$x .

# Integer column type is retained, and the double value of 1 is
# converted to an integer replacement value of 1L
df %>% replace_na(list(x = 1))
#> # A tibble: 3 × 1
#> x
#>
#> 1 1
#> 2 1
#> 3 3

Acknowledgements

Thanks to the 25 people who contributed to this version of tidyr by discussing ideas and suggesting new features! @aliaamiri, @allenbaron, @bersbersbers, @cjburgess, @DanChaltiel, @edzer, @eshom, @gaborcsardi, @gergness, @ggrothendieck, @iago-pssjd, @issactoast, @joiharalds, @LuiNov, @LukasWallrich, @mgirlich, @MichaelChirico, @NFA, @olehost, @psads-git, @psychelzh, @ramiromagno, @romainfrancois, @TimTaylor, and @xiangpin.