Help with Vectorizing a Loop

djslp · July 10, 2018, 6:55pm

I have the following tibble x.

Capture

I want to have the column new_col populated with a number signifying what to divide the sum of columns 2017.2, 2017.3, 2017.4, 2018.1 by. Columns 2 through 5 are quarterly visits to stores. A customer that has just started visiting in 2018 should not have their total transaction count divided by 4 quarters, as they are a new customer and have only come to us one quarter.

The following code in a for loop solves my question, but is taking too long to go through all the data (what appears to be multiple days).

x$new_col[i] <- 5 - min(which(x[i, 2:5] > 0))

How can I apply the 5 - min(which(x[i, 2:5] > 0)) to every row simultaneously as apposed to looping through each element? Open to suggestions within the tidyverse or in separate packages.

Leon · July 10, 2018, 7:20pm

You could take a look at rowwise()

tbradley · July 10, 2018, 7:31pm

Not sure if this is the best approach but you can use a gather/mutate/spread combo.

Here is a reprex demonstrating on the first couple rows of your data:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

x <- tribble(
  ~row, ~`2017.2`, ~`2017.3`, ~`2017.4`, ~`2018.1`,
  1, 0, 2, 3, 4, 
  2, 0, 0, 1, 0, 
  3, 0, 0, 0, 9, 
  4, 0, 2, 3, 4, 
  5, 0, 0, 1, 0
)


x %>% 
  gather(key = "key", value = "value", -row) %>% 
  group_by(row) %>% 
  mutate(key = factor(key, levels = unique(key))) %>% 
  arrange(row, key) %>% 
  mutate(new_col = 5 - min(as.numeric(.$key[value > 0]))) %>% 
  ungroup() %>% 
  spread(key = key, value = value)
#> # A tibble: 5 x 6
#>     row new_col `2017.2` `2017.3` `2017.4` `2018.1`
#>   <dbl>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#> 1     1       3        0        2        3        4
#> 2     2       2        0        0        1        0
#> 3     3       1        0        0        0        9
#> 4     4       3        0        2        3        4
#> 5     5       2        0        0        1        0

Created on 2018-07-10 by the reprex package (v0.2.0).

djslp · July 10, 2018, 7:51pm

The data was originally in a gathered format. I thought that column-wise operations would work better.

Let me try your method. My other tidyverse computations have worked quickly so far.

jcblum · July 10, 2018, 11:43pm

Here's a purrr::pmap() approach:

library(tidyverse)

x <- tribble(
  ~row, ~`2017.2`, ~`2017.3`, ~`2017.4`, ~`2018.1`,
  1, 0, 2, 3, 4, 
  2, 0, 0, 1, 0, 
  3, 0, 0, 0, 9, 
  4, 0, 2, 3, 4, 
  5, 0, 0, 1, 0
)

# This could just as easily be x$new_col <- pmap_dbl(...)
x %>% mutate(
  new_col = pmap_dbl(
    x, 
    ~ 5 - min(which(c(..2, ..3, ..4, ..5) > 0))
  )
)
#> # A tibble: 5 x 6
#>     row `2017.2` `2017.3` `2017.4` `2018.1` new_col
#>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>
#> 1     1        0        2        3        4       3
#> 2     2        0        0        1        0       2
#> 3     3        0        0        0        9       1
#> 4     4        0        2        3        4       3
#> 5     5        0        0        1        0       2

Created on 2018-07-10 by the reprex package (v0.2.0).

Note that I used pmap_dbl() because tribble() made the sample data columns doubles and I didn't bother changing them to integers. If your actual data are integers, you'll probably want to use pmap_int() instead. purrr is extremely picky about types on purpose.

I am curious to hear how it performs on your actual dataset!

rensa · July 11, 2018, 12:10am

You could also do it this way:

library(tidyverse)

df = tribble(
  ~row, ~`2017.2`, ~`2017.3`, ~`2017.4`, ~`2018.1`,
  1, 0, 2, 3, 4, 
  2, 0, 0, 1, 0, 
  3, 0, 0, 0, 9, 
  4, 0, 2, 3, 4, 
  5, 0, 0, 1, 0)

df %>% mutate(visited_quarters = case_when(
`2017.2` > 0 ~ 4,
`2017.3` > 0 ~ 3,
`2017.4` > 0 ~ 2,
`2018.1` > 0 ~ 1,
TRUE ~ NA_real_))
#> # A tibble: 5 x 6
#>     row `2017.2` `2017.3` `2017.4` `2018.1` visited_quarters
#>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>            <dbl>
#> 1     1        0        2        3        4                3
#> 2     2        0        0        1        0                2
#> 3     3        0        0        0        9                1
#> 4     4        0        2        3        4                3
#> 5     5        0        0        1        0                2

This code might start to look messy with more quarters, though, and you might not want to update the code when you have to change the quarters.

On the plus side, though, it doesn't matter if your quarter columns accidentally fall out of order. If you need to automate this, I'd tend to go for @jcblum's solution; I feel like you could find a way to modify it to work on an arbitrary selection of columns that'd been ordered correctly.

djslp · July 11, 2018, 1:33pm

I can replicate your results when I copy and paste.

However, I receive the following error when I attempt to run it on my data with my variable names:

Error in mutate_impl(.data, dots) : Column 'new_col' must be length 1 (the group size), not n #n is the actual row count of my data

Did I forget to group_by something?

jcblum · July 11, 2018, 9:21pm

More likely the other way around! The error message implies that your tibble is already grouped, most likely by the row variable, since the group size is 1. Try inserting an ungroup() in the pipeline:

library(tidyverse)

x <- tribble(
  ~row, ~`2017.2`, ~`2017.3`, ~`2017.4`, ~`2018.1`,
  1, 0, 2, 3, 4, 
  2, 0, 0, 1, 0, 
  3, 0, 0, 0, 9, 
  4, 0, 2, 3, 4, 
  5, 0, 0, 1, 0
)

# Tibble is now grouped by row
x <- group_by(x, row)

# Won't work on grouped tibble
x %>% mutate(
  new_col = pmap_dbl(
    x, 
    ~ 5 - min(which(c(..2, ..3, ..4, ..5) > 0))
  )
)
#> Error in mutate_impl(.data, dots): Column `new_col` must be length 1 (the group size), not 5

# Add `ungroup()` to the pipeline
x %>% 
  ungroup() %>% 
  mutate(
    new_col = pmap_dbl(
      x, 
      ~ 5 - min(which(c(..2, ..3, ..4, ..5) > 0))
    )
  )
#> # A tibble: 5 x 6
#>     row `2017.2` `2017.3` `2017.4` `2018.1` new_col
#>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>
#> 1     1        0        2        3        4       3
#> 2     2        0        0        1        0       2
#> 3     3        0        0        0        9       1
#> 4     4        0        2        3        4       3
#> 5     5        0        0        1        0       2

Created on 2018-07-11 by the reprex package (v0.2.0).

(You might also go back through your code to figure out when the tibble got grouped and removing the grouping code — grouping by row sounds like the kind of thing you might have done while trying different ways to solve this problem)