Ease of use and performance trade-off in data wrangling

I personally think that the tidyverse ecosystem really makes a huge difference to my daily work. It is so easy to use and has a complete set of tools suitable for most of my tasks. What's better, dplyr package can easily work with database like Postgresql and Spark, which partially solve the memory constraint issue of R. However, tidyverse is not that great when it comes to performance. To my own experience, when I use dplyr with broom, the performance gets even worse.

I started R with learning data.table, but since I came across tidyverse, I don't use data.table that frequently. I turn to data.table when dplyr is too slow.
I was wondering if there is any work being done to integrate the strength of dplyr and data.table to solve the memory constraint and performance issues in a unified framework.

1 Like

There is a package called dtplyr its far from complete but under the hood it calls data.table you can try that.

But I assume since data.table is an h2o project and dplyr is an rstudio project they aren't going to merge any time soon...

Use dplyr when you are interacting with database or medium size data or list columns. Use data.table for everything else specially shiny projects.

dtplyr is a data.table backend for dplyr: https://github.com/hadley/dtplyr - be sure and read the readme on dtplyr. dtplyr makes more copies than you would if you were working with data.table directly.

Why do you recommend data.table for shiny projects?

Shiny apps are used by more than one person and you need speed there.

Directly calling data.table gives your data wrangling way more speed for such use cases.

That's it

:grinning:

Rather than "use data.table when making shiny apps" I think better advice would be: "data.table is one option to try when you discover you need more speed from your R script"

Many shiny apps have no performance constraints due to data size and using data.table in those situations would be of no benefit.

3 Likes

I guess I am data.table guy so that's my default go to solution in any case whatsoever.:sunglasses:

But That's exactly what I meant only if you need speed use data.table else you are good to go.

:grinning:

Thanks for pointing it out.