Are dplyr results guaranteed to be consistent with and without dbplyr?

This may be more of a theoretical question/issue than a practical one (i.e., it could come up in the right circumstances, but the circumstances may not exist currently. So no reprex, either)

While searching for official documentation on assumptions/guarantees/etc that dplyr makes about row order, I came across dplyr issue #3989. In that issue, someone had made assumptions about row order that were not guaranteed at the time by dplyr, and an update to dplyr had broken their (and others') code. dplyr was subsequently updated to fix this by meeting their assumption.

Having a SQL database background, I'm inherently very cautious about row order in general with data frames and tibbles because the SQL standard makes no guarantees about how the results of user queries will be ordered by default. When users are not explicit in their SQL queries, the order of their results will be implementation dependent.

This brings me to the (potential) issue at hand. At least some portions of dplyr are accommodating users by making assumptions or guaranteeing certain things about row order, which can affect their analyses (as illustrated in the issue above). The problem is that when using dbplyr, my understanding is that the same dplyr functions are being translated to SQL queries being executed by the database. Given that these databases may have different assumptions or guarantees, does this mean that dplyr is not guaranteed to return consistent results depending on how a user manages and processes their data? In other words, could a user write "fragile" code that could break if they decide to transition their data from csv files to a RDBMS?

To be honest, relying on row order is "fragile" almost by definition.
But to answer your question directly - I think you are right in assuming that dplyr will not guarantee row order when applied to RDBMS for exactly the reason you've mentioned - SQL standard doesn't guarantee anything, so why would dplyr? That's my understanding though, and not some sort of knowledge about internals at dplyr.

1 Like

Yes, changing backends can break your code.

Consider dplyr::slice(), which is a common tool for in memory tibbles, but will throw a failure for linked RDBMS tibbles.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.