Thought process to choose a Tidyverse approach: join, mutate + map, or other

JidduAlexander · October 18, 2017, 1:07pm

Sometimes I feel I use a Tidyverse approach, but not the right one or perhaps a non Tidyverse process altogether is better. Here is an example of such a situation:

I have some standard data, ie. mpg and cyl from mtcars. I also have some label summary statistics that says Bad, Medium or good for a level of mpg for cars with certain cyl.
Note: In the example pretend that the summary labels came from elsewhere. I'm not interested in calculating summary stats in order to label the data.

The data:

library(dplyr)
library(tibble)
library(purrr)

cars <- rownames_to_column(mtcars[1:2]) %>% as_data_frame()

mpg_label <- data_frame(
  cyl_l   = rep(c(4, 6, 8), each = 3),
  label_l = rep(c("bad", "medium", "good"), 3),
  mpg_l   = c(25, 28, Inf, 18, 20, Inf, 15, 18, Inf)
)

Now I want to apply those labels to my data, where cyl is cyl_l and is below a level and the previous. Not to difficult.

If you have a minute at this point stop reading and code how you'd do that task, and important also write your thought process!

Here is mine:

My brain goes in 'some' Tidyverse mode:

use mutate to create the new label column in cars.
map over the two variables that I need (cyl and mpg)
Use that cyl and mpg mapping to find your label.

cars %>% 
  mutate(label = map2_chr(cyl, mpg,
                      function(c, m) {
                        mpg_lbl <- mpg_label %>% 
                          filter(cyl_l == c,
                                 mpg_l > m) %>% 
                          slice(1) %>% 
                          select(label) %>% 
                          `[[`(1)
                      }))

It works fine, however I have a gut feeling there are better approaches.

What if the data gets very big? Is it still great to apply filter to each. Isn't there much faster matrix variant? Or something else? What is the thought process behind that?
Is there another tidyverse approach?
- Naturally, I believe I should be looking for a join. Join by cylinder and mpg where mpg is 'the smallest of values greater than". But I'm not sure how to do that with dplyr joins. So thought process should sound like: "I need to join two tables, join table 1 on/with/anti table 2, join by x and y where z".
- I was looking for a group_by(cyl) approach but wasn't sure how to get rid of the map2.

As I mentioned I'd be very interested into the thought process that you have before you write any code. I'd love to apply that to other examples as well.

Best,
Jiddu

edgararuiz · October 18, 2017, 2:17pm

Hi @JidduAlexander, I would try using the new case_when() function in dplyr to accomplish this. Here is a quick example I put together:


library(dplyr)
library(tibble)

mtcars %>%
  rownames_to_column() %>%
  mutate(
    label = case_when(
      cyl == 4 & mpg <= 25 ~ "bad",
      cyl == 4 & mpg <= 28 ~ "medium",
      cyl == 4 & mpg > 28 ~ "good",
      TRUE ~ "other"
    )
  )

nick · October 18, 2017, 2:28pm

Here are two map-free options that use cut (not sure if that's "tidyverse-ish" or not). The first is more chain-y, but the second reads a little cleaner and avoids a duplicate filter step.

suppressPackageStartupMessages(library(tidyverse))

cars <- rownames_to_column(mtcars[1:2]) %>% as_data_frame()

mpg_label <- data_frame(
  cyl_l   = rep(c(4, 6, 8), each = 3),
  label_l = rep(c("bad", "medium", "good"), 3),
  mpg_l   = c(25, 28, Inf, 18, 20, Inf, 15, 18, Inf)
)

# Using mutate/cut with no temporary variable
cars %>%
  group_by(cyl) %>%
  mutate(label = cut(mpg,
                     breaks = mpg_label %>%
                       filter(cyl_l == first(cyl)) %>%
                       pull(mpg_l) %>%
                       c(0, .),
                     labels = mpg_label %>%
                       filter(cyl_l == first(cyl)) %>%
                       pull(label_l)))
#> # A tibble: 32 x 4
#> # Groups:   cyl [3]
#>              rowname   mpg   cyl  label
#>                <chr> <dbl> <dbl> <fctr>
#>  1         Mazda RX4  21.0     6   good
#>  2     Mazda RX4 Wag  21.0     6   good
#>  3        Datsun 710  22.8     4    bad
#>  4    Hornet 4 Drive  21.4     6   good
#>  5 Hornet Sportabout  18.7     8   good
#>  6           Valiant  18.1     6 medium
#>  7        Duster 360  14.3     8    bad
#>  8         Merc 240D  24.4     4    bad
#>  9          Merc 230  22.8     4    bad
#> 10          Merc 280  19.2     6 medium
#> # ... with 22 more rows

# Possibly cleaner with a temp variable
cars %>%
  group_by(cyl) %>%
  mutate(label = {
    lab_df <- filter(mpg_label, cyl_l == first(cyl))
    
    cut(mpg,
        breaks = c(0, lab_df$mpg_l),
        labels = lab_df$label_l)
  })
#> # A tibble: 32 x 4
#> # Groups:   cyl [3]
#>              rowname   mpg   cyl  label
#>                <chr> <dbl> <dbl> <fctr>
#>  1         Mazda RX4  21.0     6   good
#>  2     Mazda RX4 Wag  21.0     6   good
#>  3        Datsun 710  22.8     4    bad
#>  4    Hornet 4 Drive  21.4     6   good
#>  5 Hornet Sportabout  18.7     8   good
#>  6           Valiant  18.1     6 medium
#>  7        Duster 360  14.3     8    bad
#>  8         Merc 240D  24.4     4    bad
#>  9          Merc 230  22.8     4    bad
#> 10          Merc 280  19.2     6 medium
#> # ... with 22 more rows

JidduAlexander · October 18, 2017, 3:28pm

Ah, that's nice.
It's one of those features mentioned on the dplyr cheatsheet that I ignored to learn before.

Do you have a way of implementing case_when programetically when you have many cases, like in my mpg_label dataframe?

JidduAlexander · October 18, 2017, 3:41pm

Great, that's the kind of group_by solution that I knew must exist but I couldn't think of. I like it.

edgararuiz · October 18, 2017, 3:42pm

Yes, case_when() is very flexible, there are some great examples in the reference page: http://dplyr.tidyverse.org/reference/case_when.html

alistaire · October 18, 2017, 11:18pm

Here's an approach that merges nested data.frames to avoid duplicating data, then leveraging findInterval to subset the labels. It still uses map2, but here it only iterates over the groups—the findInterval call for each is vectorized—and thus should still be reasonably efficient.

library(tidyverse)

cars <- mtcars %>% 
    rownames_to_column('car') %>%
    select(1:3) %>%
    as_data_frame()

mpg_label <- data_frame(
    cyl_l   = rep(c(4, 6, 8), each = 3),
    label_l = rep(c("bad", "medium", "good"), 3),
    mpg_l   = c(25, 28, Inf, 18, 20, Inf, 15, 18, Inf)
)

inner_join(nest(cars, -cyl), 
           nest(mpg_label, -cyl_l), 
           by = c('cyl' = 'cyl_l')) %>% 
    mutate(label = map2(data.x, data.y, 
                        ~.y$label_l[findInterval(.x$mpg, c(0, .y$mpg_l))])) %>%    # intervals need a left bound
    unnest(data.x, label)
#> # A tibble: 32 x 4
#>      cyl  label            car   mpg
#>    <dbl>  <chr>          <chr> <dbl>
#>  1     6   good      Mazda RX4  21.0
#>  2     6   good  Mazda RX4 Wag  21.0
#>  3     6   good Hornet 4 Drive  21.4
#>  4     6 medium        Valiant  18.1
#>  5     6 medium       Merc 280  19.2
#>  6     6    bad      Merc 280C  17.8
#>  7     6 medium   Ferrari Dino  19.7
#>  8     4    bad     Datsun 710  22.8
#>  9     4    bad      Merc 240D  24.4
#> 10     4    bad       Merc 230  22.8
#> # ... with 22 more rows

That sounds a lot like data.table non-equi joins, e.g.

library(data.table)

cars <- mtcars[1:2]
cars$car <- rownames(cars)
setDT(cars)

mpg_label <- data.table(
    cyl_l   = rep(c(4, 6, 8), each = 3),
    label_l = rep(c("bad", "medium", "good"), 3),
    mpg_l   = c(25, 28, Inf, 18, 20, Inf, 15, 18, Inf)
)

mpg_label[cars, on = .(cyl_l = cyl, mpg_l >= mpg), mult = 'first']
#>     cyl_l label_l mpg_l                 car
#>  1:     6    good  21.0           Mazda RX4
#>  2:     6    good  21.0       Mazda RX4 Wag
#>  3:     4     bad  22.8          Datsun 710
#>  4:     6    good  21.4      Hornet 4 Drive
#>  5:     8    good  18.7   Hornet Sportabout
#>  6:     6  medium  18.1             Valiant
#>  7:     8     bad  14.3          Duster 360
#>  8:     4     bad  24.4           Merc 240D
#>  9:     4     bad  22.8            Merc 230
#> 10:     6  medium  19.2            Merc 280
#> 11:     6     bad  17.8           Merc 280C
#> 12:     8  medium  16.4          Merc 450SE
#> 13:     8  medium  17.3          Merc 450SL
#> 14:     8  medium  15.2         Merc 450SLC
#> 15:     8     bad  10.4  Cadillac Fleetwood
#> 16:     8     bad  10.4 Lincoln Continental
#> 17:     8     bad  14.7   Chrysler Imperial
#> 18:     4    good  32.4            Fiat 128
#> 19:     4    good  30.4         Honda Civic
#> 20:     4    good  33.9      Toyota Corolla
#> 21:     4     bad  21.5       Toyota Corona
#> 22:     8  medium  15.5    Dodge Challenger
#> 23:     8  medium  15.2         AMC Javelin
#> 24:     8     bad  13.3          Camaro Z28
#> 25:     8    good  19.2    Pontiac Firebird
#> 26:     4  medium  27.3           Fiat X1-9
#> 27:     4  medium  26.0       Porsche 914-2
#> 28:     4    good  30.4        Lotus Europa
#> 29:     8  medium  15.8      Ford Pantera L
#> 30:     6  medium  19.7        Ferrari Dino
#> 31:     8     bad  15.0       Maserati Bora
#> 32:     4     bad  21.4          Volvo 142E
#>     cyl_l label_l mpg_l                 car

Support for non-equi joins in dplyr has been teased, but hasn't been mentioned in a while.

thoughtfulnz · October 18, 2017, 11:47pm

My first thought would be to use consecutive ifelse()s

mutate(label_l = "unknown",
label_l = ifelse(cyl == 4 & mpg <= 25, "bad", label_l),
label_l = ifelse(cyl == 4 & mpg <= 28, "medium", label_l),
etc

which follows an approach of picking out individual entries and modifying entries matching criteria.
For a fully programmatic approach that is based on the entries contained in the second data set, I would put the thresholds into wide format using spread

cyl bad medium good
4 24 28 Inf
6 18 20 Inf
8 15 18 Inf

and then merge with the original data. This means that each individual entry would be combined with the relevant threshold information and you could then assign the value with a mutate comparing the entry with the thresholds.

jim89 · October 19, 2017, 7:43am

Support for (some) non-equi joins is present in the fuzzyjoin package by @drob

Might be worth looking at, as the original question shouts "join" at me. You might even be able to get around the non-equi join condition by doing an equi-join (i.e a "normal" join!), and then using filter() to chop out the rows that you don't want (but that might not be idea if the data are large).

mmahoney · October 19, 2017, 8:27am

Case_when all the way! Learning how to use this approach honestly was one of those "(cue the music) and then the clouds parted" moments.

JidduAlexander · October 19, 2017, 12:27pm

Cool, I like spreading mpg_label wide and joining by cyl. It allows you to have a limited amount (number of labels/columns) of vectorised filters.

JidduAlexander · October 19, 2017, 1:14pm

I must say I'm quite impressed with finding the interval of the inner_join of nested dataframes @alistaire.
I was recently introduced to the rolling joins of data.table and I'm glad you showed how to do this example. I find reading the code not at all obvious to read (ie. the word join is not even used), but in the case of these non-equi joins I am inclined to say that for me the data.table approach is so far unmatched in dplyr (despite its readibilty issues). Untill I looked into fuzzyjoin.

@jim89 I thought fuzzyjoin was mostly for joining similar character strings. But, huray, fuzzy_left_join allows match functions. In our example it requires both the lower and upper bound of the label, but that is easily done.

library(dplyr)
library(tibble)
library(purrr)
library(fuzzyjoin)

cars <- rownames_to_column(mtcars[1:2]) %>% as_data_frame()

mpg_label <- data_frame(
  cyl_l     = rep(c(4, 6, 8), each = 3),
  label_l   = rep(c("bad", "medium", "good"), 3),
  mpg_l_up  = c(25, 28, Inf, 18, 20, Inf, 15, 18, Inf),
  mpg_l_low = c(0, 25, 28, 0, 18, 20, 0, 15, 18)
)


fuzzy_left_join(cars, mpg_label, 
                by = c("cyl" = "cyl_l", "mpg" = "mpg_l_up", "mpg" = "mpg_l_low"), 
                match_fun = list(`==`, `<=`, `>`))

# # A tibble: 32 x 7
#             rowname   mpg   cyl cyl_l label_l mpg_l_up mpg_l_low
#               <chr> <dbl> <dbl> <dbl>   <chr>    <dbl>     <dbl>
# 1         Mazda RX4  21.0     6     6    good      Inf        20
# 2     Mazda RX4 Wag  21.0     6     6    good      Inf        20
# 3        Datsun 710  22.8     4     4     bad       25         0
# 4    Hornet 4 Drive  21.4     6     6    good      Inf        20
# 5 Hornet Sportabout  18.7     8     8    good      Inf        18
# 6           Valiant  18.1     6     6  medium       20        18
# 7        Duster 360  14.3     8     8     bad       15         0
# 8         Merc 240D  24.4     4     4     bad       25         0
# 9          Merc 230  22.8     4     4     bad       25         0
#10          Merc 280  19.2     6     6  medium       20        18
# # ... with 22 more rows

I love it. The description of fuzzy_join states:
"The match_fun argument is called once on a vector with all pairs of unique comparisons: thus, it should be efficient and vectorized."
I'm not a data.table expert, but I expect data.table non-equi joins to be optimised for speed ('by reference' is a term that seems to stand out).

Regarding the non-equi-joins in dplyr I was reading through various pages. Hadley mentioned (if I understood it correctly) that a lazy evaluation of cross joins followed that allow a filter to applied during the process would be the way to go. But that cross joins are evaluated immediately such that it wouldn't be a good idea when the data are large.