Comparing performance of dplyr and native SQL queries

dplyr
dbplyr
performance

#1

Is there a way to measure system load outside R (e.g., on the database) so as to compare complex dplyr queries to "native" SQL? We got into a debate during a Cascadia R Conf workshop about comparative speed / system load. In the back of my mind is the objective of getting more people in an organization to use the data so that dplyr is a clear entry point and every-day tool. The "other side" (so to speak) was arguing that writing SQL and submitting it as a query was better because it forces people to think about performance and to understand the structure of the database.

I'd be interested in opinions on the topic as well as suggestions regarding how to find data that would help make the tradeoffs clearer.


#2

I think on PostgreSQL at least one can ask for the query to be explained (with costs): https://www.postgresql.org/docs/9.3/static/sql-explain.html . So one can call explain through dplyr, or capture the SQL from dplyr using dbplyr::remote_query() and check it out.


#3

Thanks. MySQL has a similar functionality: https://dev.mysql.com/doc/refman/5.5/en/using-explain.html

Will investigate how to use those commands in the two different modes (dplyr vs sending the SQL query as text).


#4

do you have any results? I converted lots of SQL code to dplyr used in a shiny app, and the slowdown is visible (2s to populate all tables with dbGetQuery vs 4s with dplyr, both using pool). Any performance tips for dplyr in the context of sql databases?


#5

I'll bet not all dplyr code is the same, just like SQL query performance varies, right? I think that comparing performance of specific code and code translations is a really worthwhile project. See this post which talks about all the time it takes to come up with a query that actually answers a question: https://caitlinhudon.com/2018/11/28/git-sql-together/