Calculating a new variable - best practice?

SteveXD · June 7, 2018, 6:05pm

Hi there,

There's an issue I've come across a couple of times, know how to solve, but don't know if it's best practice.

Say I have the following data frame, and want to create a new variable, AB, which is the sum of variables A and B for each person:

df <- data.frame(person = rep(c("One","Two"), each = 4),
                 variable = rep(c("A","B","C","D"), 2),
                 amount = 1:8)

Normally what I would do is spread the variable column, mutate to create the new variable AB, and then gather them back together. So, using the following code:

df %>% spread(key = variable, value = amount) %>% 
  mutate(AB = A + B) %>%
  gather(A:AB, key = variable, value = amount)

My question is: is this the best approach to creating the new variable AB? Or is there a more efficient way to calculate it directly from the original data frame, e.g. perhaps somehow using group_by()?

I'd be grateful for your views.

cderv · June 7, 2018, 9:30pm

Not reshaping with tidyr but using the tidyverse you would calculate the summary table this way

library(dplyr, warn.conflicts = FALSE)
#> Warning: le package 'dplyr' a été compilé avec la version R 3.4.4
df <- data_frame(person = rep(c("One","Two"), each = 4),
                 variable = rep(c("A","B","C","D"), 2),
                 amount = 1:8)
df
#> # A tibble: 8 x 3
#>   person variable amount
#>   <chr>  <chr>     <int>
#> 1 One    A             1
#> 2 One    B             2
#> 3 One    C             3
#> 4 One    D             4
#> 5 Two    A             5
#> 6 Two    B             6
#> 7 Two    C             7
#> 8 Two    D             8
AB_df <- df %>% 
  filter(variable %in% c("A", "B")) %>%
  mutate(variable = "AB") %>%
  group_by(person, variable) %>%
  summarise(amount = sum(amount))
#> Warning: le package 'bindrcpp' a été compilé avec la version R 3.4.4
AB_df
#> # A tibble: 2 x 3
#> # Groups:   person [?]
#>   person variable amount
#>   <chr>  <chr>     <int>
#> 1 One    AB            3
#> 2 Two    AB           11
bind_rows(df, AB_df)
#> # A tibble: 10 x 3
#>    person variable amount
#>    <chr>  <chr>     <int>
#>  1 One    A             1
#>  2 One    B             2
#>  3 One    C             3
#>  4 One    D             4
#>  5 Two    A             5
#>  6 Two    B             6
#>  7 Two    C             7
#>  8 Two    D             8
#>  9 One    AB            3
#> 10 Two    AB           11

Created on 2018-06-07 by the reprex package (v0.2.0).

EconKid · June 8, 2018, 2:06am

Hi, @SteveXD ,

Here is another answer and I explain it step by step.

df %>% 
    group_by(person) %>% 
    filter(variable %in% c("A","B")) %>% 
    summarise(sum(amount))

group_by(person): do something divided by person
filter(variable %in% c("A","B")): only consider A and B
summarise(sum(amount)): get the aggregate result

cderv · June 8, 2018, 6:12am

How is this different ? It is just another order of steps in the pipeline but it is the same lacking some (adding new AB variable and row-bind to other tab).
Your explanation is a nice addition though. thanks!

EconKid · June 8, 2018, 12:41pm

Hi, @cderv,

I change a little bit on my answer, ordering the filter at first.
Because I maintain the two answer in the similar data process.
(filter first is faster).

Maybe I avoid one step, so the solution is a little faster.

> df <- data_frame(person = rep(c("One","Two"), each = 40000000),
+                  variable = rep(c("A","B","C","D"), 20000000),
+                  amount = 1:80000000)
> Start <- Sys.time()
> df %>% 
+     filter(variable %in% c("A","B")) %>% 
+     group_by(person) %>% 
+     summarise(sum(amount))
integer overflow - use sum(as.numeric(.))integer overflow - use sum(as.numeric(.))
> 
> diff <- Sys.time() - Start
> diff
Time difference of 6.808033 secs
> Start <- Sys.time()
> df %>% 
+   filter(variable %in% c("A", "B")) %>%
+   mutate(variable = "AB") %>%
+   group_by(person, variable) %>%
+   summarise(amount = sum(amount))
integer overflow - use sum(as.numeric(.))integer overflow - use sum(as.numeric(.))
> 
> diff <- Sys.time() - Start
> diff
Time difference of 9.0658 secs

jcblum · June 8, 2018, 2:44pm

It would be interesting to do a performance shootout of all of these, but for my money @SteveXD’s original solution is the most elegant to read, if maybe a little opaque if you haven’t internalized what spread and gather do.

I was thinking about the conversation in this thread which touched on the idea that it’s better to take advantage of R’s column-oriented-ness than to swim upstream with rowwise operations. I’m not completely sure how well this situation connects to those ideas, but it strikes me that maybe converting the problem to a colwise operation with a minimum of steps is a major point in favor of @SteveXD’s original solution?

SteveXD · June 8, 2018, 3:43pm

@cderv, @EconKid, @jcblum thanks for your suggestions and comments!

My personal preference is my original solution, using spread/gather, but it's useful to see at least one other option available.

cderv · June 9, 2018, 9:14am

Totally agree ! The solution with spread and gather is the most elegant.
Spreading and gathering can be costly operation on bigger data, and when the problem hits this point, keeping a rowwise problem could become the new solution.

I nice to have the two approach in one thread : colwise and rowise!

Thanks for sharing the link.