dplyr function names vs SQL

I was just teaching dplyr to an audience that knows SQL and we all collectively noticed that dplyr sometimes uses the same name as SQL for the same task ... but sometimes doesn't. This inconsistency can make it hard to remember the dplyr function names if you have prior experience with SQL.

I'm wondering if anyone here knows how these names came to be, and can speak with authority about the role that SQL played in the naming of these functions.

Identical names:

  • SELECT -> ?select
  • GROUP BY -> ?group_by
  • DESC -> ?desc

Different names

  • WHERE -> ?filter
  • ORDER BY -> ?arrange

I can see why maybe it's odd to call a function where, but if you already have group_by then why not go all in with order_by?

@arilamstein = in my opinion, dplyr and SQL coming along from different backgrounds and development paths. R from BioMedical and SQL from Business administrators. dplyr does not use IN, OR, LIKE statements too as in SQL. We need str_detect to do so. I expect some gurus will explain more.

I was under the impression that the creators of dplyr are familiar with SQL [1] and did (and still do) use it as a direct inspiration [2,3]. But SQL is not very well suited for data analysis [4] so the design of dplyr is about taking the good parts but reformulating other parts with data analysis in mind [5,6].




(not speaking with authority, but I have sources!)


1:

That said, I am very familiar with SQL

see: Disagree with Hadley's comment about databases - #15 by hadley
2:

SQL is the inspiration for dplyr’s conventions, so the translation is straightforward

source: 13 Relational data | R for Data Science

3:

Thanks to Kirill Müller, dplyr has a new experimental family of row mutation functions inspired by SQL’s UPDATE , INSERT , UPSERT , and DELETE .

source: dplyr 1.0.0: last minute additions

4: for example Why SQL is not for Analysis, but dplyr is | by Kan Nishida | learn data science

5:

If you’ve used a database before, you’ve almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different. Generally, dplyr is a little easier to use than SQL because dplyr is specialised to do data analysis

source: 13 Relational data | R for Data Science

6:

[...] dplyr maybe might be better than SQL in some ways. But I think it is, because it's trying to solve a much, much smaller problem than SQL is trying to solve. [...] I think you can rethink the language and the interface, and of course, we've learned a bunch about programming and programming languages and the 40 years since SQL has been around. So I think there's some really nice things about dplyr that just make life a little bit more pleasant.

source: SuperDataScience

2 Likes

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.