dplyr function names vs SQL

I was just teaching dplyr to an audience that knows SQL and we all collectively noticed that dplyr sometimes uses the same name as SQL for the same task ... but sometimes doesn't. This inconsistency can make it hard to remember the dplyr function names if you have prior experience with SQL.

I'm wondering if anyone here knows how these names came to be, and can speak with authority about the role that SQL played in the naming of these functions.

Identical names:

  • SELECT -> ?select
  • GROUP BY -> ?group_by
  • DESC -> ?desc

Different names

  • WHERE -> ?filter
  • ORDER BY -> ?arrange

I can see why maybe it's odd to call a function where, but if you already have group_by then why not go all in with order_by?

@arilamstein = in my opinion, dplyr and SQL coming along from different backgrounds and development paths. R from BioMedical and SQL from Business administrators. dplyr does not use IN, OR, LIKE statements too as in SQL. We need str_detect to do so. I expect some gurus will explain more.

I was under the impression that the creators of dplyr are familiar with SQL [1] and did (and still do) use it as a direct inspiration [2,3]. But SQL is not very well suited for data analysis [4] so the design of dplyr is about taking the good parts but reformulating other parts with data analysis in mind [5,6].




(not speaking with authority, but I have sources!)


1:

That said, I am very familiar with SQL

see: Disagree with Hadley's comment about databases
2:

SQL is the inspiration for dplyr’s conventions, so the translation is straightforward

source: https://r4ds.had.co.nz/relational-data.html

3:

Thanks to Kirill Müller, dplyr has a new experimental family of row mutation functions inspired by SQL’s UPDATE , INSERT , UPSERT , and DELETE .

source: https://www.tidyverse.org/blog/2020/05/dplyr-1-0-0-last-minute-additions/

4: for example https://blog.exploratory.io/why-sql-is-not-for-analysis-but-dplyr-is-5e180fef6aa7

5:

If you’ve used a database before, you’ve almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different. Generally, dplyr is a little easier to use than SQL because dplyr is specialised to do data analysis

source: https://r4ds.had.co.nz/relational-data.html

6:

[...] dplyr maybe might be better than SQL in some ways. But I think it is, because it's trying to solve a much, much smaller problem than SQL is trying to solve. [...] I think you can rethink the language and the interface, and of course, we've learned a bunch about programming and programming languages and the 40 years since SQL has been around. So I think there's some really nice things about dplyr that just make life a little bit more pleasant.

source: https://www.superdatascience.com/podcast/hadley-wickham-talks-integration-and-future-of-python-and-r

1 Like