Comparing performance of dplyr and native SQL queries

dplyr
dbplyr
performance

#1

Is there a way to measure system load outside R (e.g., on the database) so as to compare complex dplyr queries to "native" SQL? We got into a debate during a Cascadia R Conf workshop about comparative speed / system load. In the back of my mind is the objective of getting more people in an organization to use the data so that dplyr is a clear entry point and every-day tool. The "other side" (so to speak) was arguing that writing SQL and submitting it as a query was better because it forces people to think about performance and to understand the structure of the database.

I'd be interested in opinions on the topic as well as suggestions regarding how to find data that would help make the tradeoffs clearer.


#2

I think on PostgreSQL at least one can ask for the query to be explained (with costs): https://www.postgresql.org/docs/9.3/static/sql-explain.html . So one can call explain through dplyr, or capture the SQL from dplyr using dbplyr::remote_query() and check it out.


#3

Thanks. MySQL has a similar functionality: https://dev.mysql.com/doc/refman/5.5/en/using-explain.html

Will investigate how to use those commands in the two different modes (dplyr vs sending the SQL query as text).