Summary fields from custom function returning one level list or one row data.frame

Dear community,

I am misunderstanding something or approaching my problem from a wrong perspective. I need your help to point me in the proper direction. Both in term of syntax and performance optimisation.

Here is my problem: I need to calculate and derived several values from aggregates like mean, sd.

First approach was to pseudo-code as:
data %>% group_by(g) %>% summarise(fl1 = mean(x), fl2 = mean(x) / sd(x) )
Issue is that mean was recalculated in each fields. Not very efficient with thousands of groups and a dozen of reference to mean, sd, min, max, etc.

Second approach was to create a function taking the vector of x, calculating all my fields, returning a one row data.frame. The pseudo code becomes:
data %>% group_by(g) %>% summarise(fl = list(f(x))) %>% unnest(c(fl))
Here the code appear to be slow in term of performance. Also I am not comfortable with the syntax. It looks a bit off so I am not sure it is the proper and elegant way to do it.

For the third approach, I tried to return a one level list from my function. I was not able to unlist it properly so that elements are transformed into columns / fields.

So how would you recommend as a proper approach using the tidyverse syntax?

Would you recommend to use more standard function like apply or other package like purr to handled such problem?

Thanks in advance and best regards,

jm

Is it actually slow with the first approach? You are right in that mean(x) will be calculated twice, but these type of operations are extremely fast, so overhead is usually negligible. If you want to optimize that part, you can return both mean(x) and sd(x) in your first summarize and then simply use mutate in a next step and calculated fl2 this way. So, something like this:

... %>%
  summarize(fl1 = mean(x), sd_x = sd(x)) %>%
  mutate(fl2 = fl1/sd_x)

Not sure how much faster (if at all) this would be.

Thank you for your feedback. I will try to precalculate in the summarise and then mutate.

I did find the performance slow. For instance, I am reusing the mean 20 times. So I do have a performance improvement by injecting that value directly. Such improvement is significant for my shiny app reactivity. All these small things already improve execution by 100 times. And then // is an option.

After writing this post, I kept doing some tests and profiling:

  • Returning a tibble rather than a data.frame did improve the execution speed by 20%. I do not understand why.
  • I found the syntax to unnest_wider a list. Execution time is again 20% faster. Profiling shows that time is spent 50% generating the data and 50% "rectangling" it.

So I guess that while the code execute, the performance has still some room for improvement. Hence my question as I feel I am missing some "iron path" to function that does this kind operation more efficiently. Like maybe apply or purrr. Somehow it is counter intuitive to me that calling n times a function in a DLL is slower than executing n custom calculations. Same regarding time spent rectangling n rows with one column with the same structure. The data I am manipulating is a couple of Mb when saved as Rds.

Regards,

jm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.