Right way to do grouped `[` with dplyr?

jalsalam · September 30, 2017, 2:03am

I've tried lots of different things, but keep feeling like I must be missing the obvious solution.

What is the tidyverse way to do safe indexing by-group? For example, say I wanted to calculate the change in population from 2000 to 2012 from the populations dataset (provided in tidyr)

The basic approach with [[ or [ errors if some countries don't have a value for the given year (e.g., Montenegro)

suppressPackageStartupMessages(library("tidyverse"))

pop_chg <- population %>%
  group_by(country) %>%
  summarize(chg_2000_to_2012 = (population[year == 2012]) / (population[year ==2000]))

What I want is for these situations to return NA_real_

I played around with %||% but [ is returning numeric(0), which is different from NULL, so it doesn't help

pop_chg <- population %>%
  group_by(country) %>%
  summarize(chg_2000_to_2012 = (population[year == 2012] %||% NA_real_) / (population[year ==2000] %||% NA_real_))

I have been able to get a combination of dplyr::nth and purrr::detect_index to work safely

pop_chg <- population %>%
  group_by(country) %>%
  summarize(chg_2000_to_2012 = (nth(population, detect_index(year, ~.x == 2012))) /
                                  nth(population, detect_index(year, ~.x == 2000)))

But this is so verbose for something relatively simple, there must be a better way, right? Also, when I did this on
a large-ish dataset, it was a lot slower than simple [

A second way that works is to make a variation on %||% that uses rlang::is_empty instead of is.null
suppressPackageStartupMessages(library("tidyverse"))


`%|||%` <- function (x, y) 
{
  if (rlang::is_empty(x)) {
    y
  }
  else {
    x
  }
}

population %>%
  group_by(country) %>%
  summarize(chg = 
              (population[year == 2012] %|||% NA_real_) / 
              (population[year == 2000] %|||% NA_real_))

This is what I am using at the moment, which works, but it feels strange because this seems like such a common operation I was surprised that I would need to define something new, so I assume I'm missing the idiomatic approach.

nick · September 30, 2017, 2:39am

While there's definitely multiple ways to go about this, the most straightforward off the top of my head is with tidyr::spread:

suppressPackageStartupMessages(library(tidyverse))

pop_chg <- population %>%
  filter(year %in% c(2000, 2012)) %>%
  spread(year, population) %>%
  mutate(chg_2000_to_2012 = `2012`/`2000`)

pop_chg
#> # A tibble: 219 x 4
#>                country   `2000`   `2012` chg_2000_to_2012
#>                  <chr>    <int>    <int>            <dbl>
#>  1         Afghanistan 20595360 29824536        1.4481192
#>  2             Albania  3304948  3162083        0.9567724
#>  3             Algeria 31719449 38481705        1.2131896
#>  4      American Samoa    57522    55128        0.9583811
#>  5             Andorra    65399    78360        1.1981835
#>  6              Angola 13924930 20820525        1.4951978
#>  7            Anguilla    11071    14132        1.2764881
#>  8 Antigua and Barbuda    77648    89069        1.1470869
#>  9           Argentina 36903067 41086927        1.1133743
#> 10             Armenia  3076098  2969081        0.9652101
#> # ... with 209 more rows
pop_chg %>% filter(is.na(chg_2000_to_2012))
#> # A tibble: 9 x 4
#>                             country   `2000`   `2012` chg_2000_to_2012
#>                               <chr>    <int>    <int>            <dbl>
#> 1 Bonaire, Saint Eustatius and Saba       NA    18694               NA
#> 2                           Curaçao       NA   155293               NA
#> 3                        Montenegro       NA   621081               NA
#> 4              Netherlands Antilles   177838       NA               NA
#> 5                            Serbia       NA  9552553               NA
#> 6               Serbia & Montenegro 10883641       NA               NA
#> 7         Sint Maarten (Dutch part)       NA    44355               NA
#> 8                       South Sudan       NA 10837527               NA
#> 9                       Timor-Leste       NA  1114106               NA

It's possible that won't translate as easily to your specific need. Getting the variables side-by-side could help, though.

cderv · September 30, 2017, 8:02am

I think tidyr::complete can help you. In your code, just complete your table with missing value before calculating the grouping summary indicator.

suppressPackageStartupMessages(library("tidyverse"))

population %>%
  filter(year %in% c(2000, 2012)) %>%
  complete(country, year) %>%
  group_by(country) %>%
  summarize(chg_2000_to_2012 = (population[year == 2012]) / (population[year ==2000])) %>%
  filter(is.na(chg_2000_to_2012))
#> # A tibble: 9 x 2
#>                             country chg_2000_to_2012
#>                               <chr>            <dbl>
#> 1 Bonaire, Saint Eustatius and Saba               NA
#> 2                           Curaçao               NA
#> 3                        Montenegro               NA
#> 4              Netherlands Antilles               NA
#> 5                            Serbia               NA
#> 6               Serbia & Montenegro               NA
#> 7         Sint Maarten (Dutch part)               NA
#> 8                       South Sudan               NA
#> 9                       Timor-Leste               NA

complete(country, year) will make sure there is one line per country and per year in the table. Every country with missing year 2000 will be added. Then your division operation will succeed. By default, it will complete with NA value but you can use fill= argument to fill with the value you want - it can be different by column. see help("complete", package = "tidyr")

Tazinho · September 30, 2017, 8:21am

I like the solution with complete, but one could also go for an if else construct. Note that this can be dangerous, so I will point out several solutions involving base::if()...else(), dplyr::if_else() and base::ifelse().

The best one is base::if()... else()

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyr))

population %>% 
  group_by(country) %>% 
  summarise(chg_2000_to_2012 = if(all(c(2000, 2012) %in% year)){
    population[year == 2012]/population[year == 2000]
  } else {
    NA_real_
  })
#> # A tibble: 219 x 2
#>                country chg_2000_to_2012
#>                  <chr>            <dbl>
#>  1         Afghanistan        1.4481192
#>  2             Albania        0.9567724
#>  3             Algeria        1.2131896
#>  4      American Samoa        0.9583811
#>  5             Andorra        1.1981835
#>  6              Angola        1.4951978
#>  7            Anguilla        1.2764881
#>  8 Antigua and Barbuda        1.1470869
#>  9           Argentina        1.1133743
#> 10             Armenia        0.9652101
#> # ... with 209 more rows

Usually one should use dplyr::if_else(). However, in this case it will result in an error (which is intended), since some of the countries don't contain a year 2000 or 2012 observation, as also mentioned within the question.

population %>% 
  group_by(country) %>% 
  summarise(chg_2000_to_2012 = if_else(all(c(2000, 2012) %in% year),
                                       population[year == 2012]/population[year == 2000],
                                       NA_real_))
#> Error in summarise_impl(.data, dots): Evaluation error: `true` must be length 1 (length of `condition`), not 0.

So it is really better to use the if() else() from above.

One might also like to come up with base::ifelse(), which works in this case (because the data looks good and there is maximum one observation per combination of country and year)

population %>% 
  group_by(country) %>% 
  summarise(chg_2000_to_2012 = ifelse(all(c(2000, 2012) %in% year),
                                      population[year == 2012]/population[year == 2000],
                                      NA_real_))
#> # A tibble: 219 x 2
#>                country chg_2000_to_2012
#>                  <chr>            <dbl>
#>  1         Afghanistan        1.4481192
#>  2             Albania        0.9567724
#>  3             Algeria        1.2131896
#>  4      American Samoa        0.9583811
#>  5             Andorra        1.1981835
#>  6              Angola        1.4951978
#>  7            Anguilla        1.2764881
#>  8 Antigua and Barbuda        1.1470869
#>  9           Argentina        1.1133743
#> 10             Armenia        0.9652101
#> # ... with 209 more rows

If we add a second observation for Afghanistan in 2012, we will end up with results that might not have been intended.

population_help <- tibble::tibble(country = "Afghanistan", year = 2012L, population = 139213L)
population2 <- bind_rows(population, population_help)
population2 %>% filter(country == "Afghanistan", year %in% c(2000, 2012))
#> # A tibble: 3 x 3
#>       country  year population
#>         <chr> <int>      <int>
#> 1 Afghanistan  2000   20595360
#> 2 Afghanistan  2012   29824536
#> 3 Afghanistan  2012     139213

If we run our code with base::ifelse(), it will work, just taking the first observation with year 2012 and forget about the second, which is not safe (at least it is almost arbitrary).

population2 %>% 
  group_by(country) %>% 
  summarise(chg_2000_to_2012 = ifelse(all(c(2000, 2012) %in% year),
                                      population[year == 2012]/population[year == 2000],
                                      NA_real_)) %>% 
  filter(country == "Afghanistan")
#> # A tibble: 1 x 2
#>       country chg_2000_to_2012
#>         <chr>            <dbl>
#> 1 Afghanistan         1.448119

Whereas our initial if() else() construct will work perfectly. Both values are returned, and we will get an error within dplyr::summary(), since a summary should contain only one value.

population2 %>% 
group_by(country) %>% 
summarise(chg_2000_to_2012 = if(all(c(2000, 2012) %in% year)){
population[year == 2012]/population[year == 2000]
} else {
NA_real_
})
#> Error in summarise_impl(.data, dots): Column `chg_2000_to_2012` must be length 1 (a summary value), not 2

cderv · September 30, 2017, 9:33am

It occurs to me that purrr has function to handle errors properly in the tidyverse. You can then also use possibly to catch error and return a value instead.

It can be used like this here:

suppressPackageStartupMessages(library("tidyverse"))

population %>%
  group_by(country) %>%
  summarize(chg_2000_to_2012 = possibly((population[year == 2012])/(population[year ==2000]), otherwise = NA_real_)()) %>%
  filter(is.na(chg_2000_to_2012))
#> # A tibble: 219 x 2
#>                country chg_2000_to_2012
#>                  <chr>            <dbl>
#>  1         Afghanistan               NA
#>  2             Albania               NA
#>  3             Algeria               NA
#>  4      American Samoa               NA
#>  5             Andorra               NA
#>  6              Angola               NA
#>  7            Anguilla               NA
#>  8 Antigua and Barbuda               NA
#>  9           Argentina               NA
#> 10             Armenia               NA
#> # ... with 209 more rows

jalsalam · September 30, 2017, 11:41am

This seems like a clean solution. For this type of operation, it seems like explicit missing values are mostly needed.

jalsalam · September 30, 2017, 11:45am

Thank you for going into so much detail! It was helpful that you addressed the multiple observation as well as missing observation scenarios in your if() variations -- I know that multiple might sometimes act strangely with [ as well. Its interesting that the non-vectorized if()... else() works better than the vectorized if_else's that normally would be used inside dplyr verbs.

nick · September 30, 2017, 12:15pm

Note that this possibly method produces NA for every country (as shown in the output). There's two issues. The first is that possibly requires a function as input, so in this case, every group is throwing an error and falling back to otherwise. For example, try:

possibly(sin(1), otherwise = NA)()

This can be solved by throwing a ~ in front of the calculation. However, this leads to the second issue, ending up with the original error, as 10/c() is a valid calculation that possibly passes without falling back to otherwise.

jalsalam · September 30, 2017, 12:51pm

Oops, good point. I was lulled in by the nicely phrased code, not noticing that the results weren't turning out as expected. I switched the "solved" label to the "complete" based solution, though I think that for different situations some people might prefer spread or if() type solutions.

I am still sort of left wondering if this is a common enough problem to deserve a better solution.

I find I run into this type of problem frequently. A lot of my data is output from economic models and I am calculating things like "change from base year" or "change in policy scenario relative to the baseline scenario", etc, which are most naturally expressed with different grouped [ operations.

Both dplyr::nth and purrr::pluck strike me as nearly [[ replacements, but neither is really designed to index one vector by a logical question on a second vector.

hadley · September 30, 2017, 10:18pm

It seems like we need something like nth() that supports indexing via a different column.

nick · September 30, 2017, 10:50pm

Speaking of nth(), that brought to mind one other option:

suppressPackageStartupMessages(library(tidyverse))

population %>%
  group_by(country) %>%
  summarize(chg_2000_to_2012 = first(population[year == 2012] / population[year == 2000],
                                     default = NA_real_)) %>%
  filter(is.na(chg_2000_to_2012))
#> # A tibble: 9 x 2
#>                             country chg_2000_to_2012
#>                               <chr>            <dbl>
#> 1 Bonaire, Saint Eustatius and Saba               NA
#> 2                           Curaçao               NA
#> 3                        Montenegro               NA
#> 4              Netherlands Antilles               NA
#> 5                            Serbia               NA
#> 6               Serbia & Montenegro               NA
#> 7         Sint Maarten (Dutch part)               NA
#> 8                       South Sudan               NA
#> 9                       Timor-Leste               NA

I hadn't really thought of first() being a way to guarantee (for some definitions of "guarantee") a single-item vector, but it works pretty nicely in this case. But yes, indexing one variable by another seems like a common enough situation to warrant a tidy-er function.

thoughtfulnz · October 1, 2017, 6:58am

How about reducing it to the relevant years, making sure it is in the correct order, then using lag.

pop_chg <- population %>%
  filter(year == 2000 | year == 2012) %>% glimpse() %>%
  arrange(country, year) %>% glimpse() %>%
  group_by(country) %>%
  mutate(chg_2000_to_2012 = population / lag(population)) %>%
  ungroup() %>% filter(year == 2012)

cderv · October 1, 2017, 1:44pm

You are right, I have wanted to try this idea but it does not suits this specific problem. complete is ok for this job.
Thanks for the details.