tidyverse/dplyr in regular R application

Can tidyvers/dplyr (with pipes) be used in the regular R application development or maybe this is more for interactive programming or scripts for conducting particular calculations? Are there any limitations to use tidyvers/dplyr (with pipes) as a part of regular R application? Any recommendation?

It depends on what you want to achieve.

In general - why not. That's how most of the services are running in, for example, company I work at. tidyeval makes it quite easy to program over dplyr, so that's not a big issue as well. There is, of course, some bias, but for me piping makes it much easier to understand the flow of the program. And since R is rarely, if ever, used for blazing speed, then adding pipe overhead is never a bottleneck.

So, all in all, I would say that there is nothing stopping you from trying, but it depends a lot on your use-case.

1 Like

For making things like R packages, make your dependencies as minimal as possible. I'm not an R developer, nor I made any R packages, so I can't give you my personal advice.

For interactive programming however, pipe is really useful. It make so that you read it from left to right.

1 Like

Local Applications

For your own or in-house applications, use whatever packages make your life easier. I have some CLIs that make extensive use of the tidyverse packages.

Public Packages

For packages that are aimed at public distribution and uses, don't depend on the tidyverse packages. It's rarely the case where a developer is happy about importing a library where a large number of dependencies come along with it.

1 Like

I disagree with this completely. There's nothing about distributing a package that makes tidyverse less fit for purpose. Actually quite the opposite. If you are building a package that is designed to work with other tidyverse packages then you should certainly build your package using tidyverse. Attempting to write a tidyverse compatible package without using tidyverse is amazingly foolish.

2 Likes

If you mean that you shouldn't put tidyverse in Imports: then I agree completely.

But I'm not sure how putting dplyr, for example, or purrr, or forcats is bad, especially if you have a good reason to use some of these packages.

1 Like

Unless the package was intended to integrate with and depend on the tidyverse packages, importing them may add dependencies that the developer never wanted.

For strawman example, a package that does some interesting categorical analysis with data frames and vectors. Perhaps was developed using tibble and forcats, but the interface doesn't require tibbles and nothing about the operation of it requires them. It's a convenience. Requiring this package is then going to in turn bring in a lot of extra dependencies that may not be welcome for the application developer.

In short, adding forcats or purrr is going to pull in tibble and that's a bunch of dependencies that aren't relevant if the package isn't intended or doesn't need to interoperate with the tidyverse.

1 Like

Thank you for all answers!

On the one hand I agree that too many imported packages (some unneeded, but part of library) are not the best option. On the other hand I see benefits when dplyr with pipes is used (e.g. code readability).

I think I will follow kind of compromise - I will just use/import selected packages instead of tidyverse, taking into account that some trigger others like tibble.

1 Like

A new development is that purrr now only Suggests tibble. This should help.

import(tidyverse)

is a convenience interactively, because it keeps you from getting hung up on the rain forest being missing from your ecosystem.

For packages, I agree that better practice is to import only those packages from tidyverse that are needed and, for those used only a few times, to :: them.

On the other hand, a large package library is hardly expensive in terms of any kind of resource. Once a day

update.packages()

(which can even be cron'd) is hardly a burden, and where there's a later source package that may be problematic because you're on OSx, it's easy to skip and fall back on the earlier byte package.

1 Like

To add to/emphasize what others have statet:

If you build packages, be careful about which dependencies you choose. There are some tidyverse packages like purrr() and tibble() that are intentionally kept slim for developers and that might be ok for packages (though i find it rarely useful to add them). Try to avoid dplyr and tidyr whenever possible though as they are really "heavy".

Do you have some kind of reference for this statement? It seems to make assumptions. While I understand why you want to not include unnecessary packages (like importing all of tidyverse when you only use 2-3 packages within it) but I don't see the harm in using these packages. Do you know people who find a package they want to use and start to install it, only to stop installation if it says they will have to install 10 packages as opposed to say 1 or 2? Personally (and this may be a wrong approach), I would just let it install all of them and then continue going about what I wanted to do with the new package.

When I am writing functions to solve some kind of problem, my most creative solutions to solving problems are typically using tidyverse tools. So if i restrict myself from using these packages it will just force me to not use the best tool in my proverbial toolbox (not a statement saying these packages are the best, just that they are the ones I, personally, use best)...

So, I guess my question is, what kind of data do you have to make that kind of assertion?

I would cite a few of instances in this thread about importing tidyverse.

If importing many libraries is of no consequence to the developer, I would not expect to see that recommendation.

Similarly, I would not expect a recommendation of importing dplyr if you're only using setdiff?

Yes. There are production environments that have restrictions on the dependencies that can be imported or used or even built for the target system. Additionally, some shops have to whitelist the packages and versions in their dependencies.

Interesting, I was not thinking about it from a production system standpoint. And just to be clear, I wasn't advocating for importing all of tidyverse like your first quote of mine seems to imply when read out of context. I agree that importing all of the tidyverse is likely not required (or good practice) unless it is an extension of the tidyverse

Also, while I agree that importing dplyr for a single function may not be worth the dependency (especially when you can just use importFrom for a single function) but if you have to reinvent the wheel just to avoid the dependency, I think it is a more complicated issue.

Please read the following paragraph under the assumption that we are talking about library code, that is packages that do generic tasks and are aimed at a larger audience.

I made the experience that you usually don't have to reinvent the wheel. dplyr doesn't offer very much that you cannot do in base (though its often more awkward there). dplyr also comes with (relatively) a lot of strings attached. If you have to work a lot with tabular data (aggregating, transforming, etc...), consider importing data.table instead which is written in plain c and only depends on base (as opposed dplyr which is an ecosystem on its own).

Look for example at sf. That's a pretty well designed package that is pretty close to the tidyverse and even it avoids dplyr dependencies (though it works well together with dplyr). Also note that tidyverse packages themselves avoid importing the "heavy" packages tidyr and dplyr.

Now if you write analysis packages for yourself there is nothing wrong with using whatever packags you like (but you still do yourself a favor by keeping dependencies slim)

1 Like

@hoelk a NYC radio host back in the 70s, Barry Farber, used to say

That's why they make vanilla and chocolate

Your comments have provided food for thought. There are arguments for bare metal lean and there are arguments for with-all-the-options. Everyone has to find her own way, and some poor souls working in multiple contexts have to be able to thread their way through multiple ways.

Thanks for the encouragement for everyone from time to time to think about what they should have in their toolchest!

4 Likes

I sound a bit as If im spreading dplyr hate. I think dplyr is pretty awesome, I just think if you use it inside a package that is to be reused by other people you need to have good arguments why (and a few mutate, select and group by calls don't cut it).

If you are doing something very high level and are already consolidating from various data sources its a different topic (i use dplyr to work with the above mentioned sf for example)

5 Likes

After doing a bunch of translations between tidyverse and base code on SO I realised that tidyverse code is not necessarily more compact on average, it might even be longer (thanks in good part to these group_by / ungroup couples. It might be more readable but I'm not even so sure about that (except for formula notation for functions which is really a blessing). It's also generally slower than base in general, with exceptions, and pipes are harder to debug.

What tidyverse code is really good for in my opinion is for typing as you think, edit, insert a step in the existing pipe chain and structure your code into steps that only do one thing. I believe that for packages made for public consumption it should be avoided UNLESS they are built specifically to work in the tidyverse ecosystem (And then i'd still avoid pipes).

1 Like

Thanks for swimming against the tide of popular sentiment here! :grin:

I think you're right in the sense that there's an ecosystem in which it shows to best advantage, one in which you are mainly doing as much as possible in a consistent grammar to borrow a favorite term. And a denizen from outside the ecosystem won't reap the same benefits until they accept Tidy as their own personal Savior (insert gentle self mocking emoticon here).

I think you're wrong about %>% which is not a | but performs the same office of passing stdin to stdout, something it's been doing for me for over 25 years. It reduces the plagues of temporary variables and the lispuses (LISP locusts consisting of nested nested ... nested parens).

I'm not going to go so far as you accuse you of being an EMACs victim (another :grin:), but you haven't pried the pipe out of my cold dead hands yet.

Cheers!

1 Like

Reducing the need for temp variables is an advantage i forgot to mention, but it's an important one, in interactive code that is, in a well programmed function this should not be an issue. Besides all of this I love the tidyverse very much, because most of my work is interactive.