how to select among hundreds of R packages??

hi all! ) i'm just now beginning to learn R and have couple of really stupid questions..) after googling for useful packages for advanced analysis/visualisations of relationships between 2 variables, i came up w these:

corrplot
ggpubr
psych
GGally
ggstatsplot
easystats
rstatix
correlation
stats
Hmisc
SmartEDA
DataExplorer
corrr

there seems to be no consensus regarding what to use for correlations and scatterplots out there on the interwebs... so... there are two questions for R experts:

  1. how do you go about selecting the right package in R for a particular test you need to run, given such abundance of options and no consensus on what's best? (for any stats test, not just correlation)

  2. how do you assess reliability/quality of the package (besides the mix of features it offers to analysts), given there are so many packages out there? how do you go about assessing reliability/quality/integrity of the package before deciding to add it to your R environment? do you personally have to inspect the raw code? i can't seem to find any community-based peer-review quality-assessment systems out there...

i work for big traditional organization and need to be able to justify my selection of analytical tools to my superiors. plus, wanna make sure i use only high-quality tools.

thank you guys! i've spent 20 years in data analysis (SPSS in academic research, mostly). still trying to wrap my head around R.

Big question. Here's how I'd approach it at a big level of generality, without knowing the nature of the organization's data, the types of analyses it runs, the analytic sophistication applied and the level of devotion to embellishing results to make things look pretty.

Origin story. Linuxand R both derive from software developed in-house at Bell Labs during the 1970s—UNIX and S.

The fisrt came about because some guys found getting computer time inconvenient. They scrounged a cast-off computer that didn't have an operating system. So, one guy did that, another guy invented C and others pitched in with a design for a shell, a text parser, a language to make flat files usable as databases and make reports. bash, grep, awk and dozens of others that were recreated in GNU, BSD and the other derivatives. Imagine a world where all of these tools had to be withdrawn from circulation. There would literally be no internet. One of those guys, Brian Kernighan has written UNIX: A History of Memoir tells the backstory.

I should mention that Bell Labs staff have racked up nine Nobel Prizes, five Turing Awards, and 22 IEEE Medals of Honor. For things like the transistor, information theory, discovering the cosmic background radiation, fast Fourier transforms. Not to mention the collection of Emmys, Grammys and Oscars.

One of the staff statisticians, John W. Tukey, pushed a practice of exploratory data analysis as a necessary component to understand the questions that a data set could answer before trying to make the data confirm something it could not. The nature of EDA is somewhat spontaneous and unstructured compared to confirmatory statistics. While the statistics department had a robust library of Fortran for a large set of statistical tasks, using it could be inconvenient for problems such as running a quick linear regression of a few dozen x,y points. As Richard Becker put it

The idea of writing a Fortran program that called library routines for something like this was unappealing. While the actual regression was done in a single subroutine call, the program had to do its own input and output, and time spent in I/O often dominated the actual computations. Even more importantly, the effort expended on programming was out of proportion to the size of the problem. An interactive facility could make such work much easier.

But realizing a tool for that ease of use wasn't trivial. Developing an interactive front-end not only a lot of under-the-hood stuff but also a design decision to make S (which became open-sourced as R) a functional language based on a formal grammar with few restrictions and a data structure in built as vectors of like-elements: numbers, character strings, or logical values. Along the way functions became first class objects, the result of which is that an elaborate macro process for compiling in new functions to the language itself was replaced with an in-process solution. Then came the ability to link to system libraries to import implementation of algorithms written in FORTRAN and other languages.

S became a mainstay for the statistics department for ad hoc projects and as word got out into the larger community, requests for it grew. But, as it was hard to port from one OS to another, the solution was hit upon to have it hitch a ride on UNIX, which was also in the process of disseminating broadly. Eventually, it was made available commercially.

During the course of its evolution is was being used by statisticians who were only incidentially developers, the Dog Food model. From Richard Becker, again,

If we hadn’t considered S our primary research tool, we would probably not have kept it up-to-date with the latest algorithms and statistical methods. S has always had new methodology available quickly, at least in part because it was designed to be extended by users as well as its originators.

That is a decent pedigree for a domain specific language designed to be used interactively by statisticians and to accommodate new algorithms developed by statistics. It was a monumental accomplishment also in light that it was the part-time work of a small core team.

That work has been carried on in the open-source version, R by some of the original participants of S and a worldwide group of other statisticians and computer scientists since 1994.

R quickly overtook the installed base of S and has an ever-growing population of packages—10^4 and counting. Over the nearly 30 years it has been in the wild, Darwin has been at work with packages coming into view and surviving to be co-opted into the language, as happened with {magrittr}s %>% which became |>. Unsuccessful packages recede into the dim background for lack of maintenance. Successful packages get massively stress tested by real-world use and issues percolate through venues such as stackoverflow and on to bug reports.

Long wind-up for the pitch: The good and bad news is that for any particular algorithm there is usually more than one choice and often several. It can feel like a foodie dying of starvation while trying to finish reading the menu.

Let's divide the menu into courses.

  1. Undergrad textbook stuff
  2. Graduate textbook stuff
  3. Publish or perish stuff
  4. Subject matter stuff
  5. User-friendly stuff
  6. Eye candy

Let's dispose of the two auxiliary cases.

For user friendly there is pre-eminently the tidyverse. It has become what most people think of in connection with R. It does a great job of overcoming our past trauma with punctuation based symbolic manipulation that we carry over from school algebra. It has two drawbacks, however. First, it nudges the language in a procedural/imperative direction—do this then do that. Second, doing so misdirects the users attention from thinking about whether some plan of analysis actually does to carry out the purpose of the coding problem to how to get a particular sequence of steps to run without throwing an error. Functions in {base} throw simpler errors because there is less syntax to thread. Whether to encourage that depends on the user base and the willingness of the designated support person to help with problems that could be simpler.

Decoration, embellishment, interactivity, dashboards, KPI animations and dot plots with smiley faces have become pervasive interests of a large part of the user base. I'll leave that just with a reference to Edward Tufte's Cognitive Style of Power Point.

Next is substantive domain stuff. The {bioconductor} space for life sciences is the biggest example. The whole econometrics field labors in a garden of packages of their own. The economists, in particular, seem to feel compelled to invent their own terminology and variation for standard technique. There are hosts of smaller examples, which you can see in CRAN task views or with {ctv}. In searching using correlation as a keyword, you might run across {powerSurvEpi}

Functions to calculate power and sample size for testing main effect or interaction effect in the survival analysis of epidemiological studies (non-randomized studies), taking into account the correlation between the covariate of the interest and other covariates.

If your organization doesn't do epidemiology, why even check this out?

My recommendation is that in considering any tool not otherwise settled upon, check the task view characterization and the description headline. You probably don't need it.

Next, let's knock off the cutting edge stuff.

Package GeeWhiz implements a novel algorithm for the detection of time-varying quandoids as described by Feather, Bed and Spread in their 2023 preprint.

Why go there?

That leaves basic and advanced textbook stuff, bread and butter work.

The basic stuff is all captured by the standard packages brought in by installation. Almost definitionally, none of that stuff can be wrong. A better cor(), lm(), shapiro.test? On a standalone basis, probably not. As part of a workflow package to bring in data and push it out in a particular tabular layout, other packages may have some advantage. But let those come to you, don't seek them out.

By this time, most of the graduate level standard texts have related R packages. Want a comprehensive set of regression tools? Get Frank Harrell's Regression Modeling Strategies and the accompanying {rms} package (and {Hmisc} in any event). Just one example.

Finally, consider how your users are going to be working. Let's take one of the standard EDA tools using two tools to look at the relationships among four variables in a data frame

with pairs(mtcars[1:4])

image

and with GGally::ggpairs(mtcars[1:4]

image

To me, it's a matter of personal preference and the stage at which I am in the analysis. For a first peak, I'd probably use pairs(); later a closer look with ggpairs() might be helpful.

When I run across the package that I want to take home to meet the parents, my checklist

  1. How mature is it?
  2. How complicated are its dependencies?
  3. Is it a vanity project or the work of a community?
  4. What other packages suggest it?
  5. Characterized in a task view?
  6. Associated Journal of Statistical Software introductory article?
  7. Source of algorithms cited or original algorithms adequately described?
  8. How well documented?
  9. Discussions in repo or forums?

Start small and get experience with the defaults. Identify any gotta have enhancements and look for a user-installed package. Evaluate and run tests. Read the source if indicated.

4 Likes

I have to say technocrat gave you an absolutely superb answer, where I myself read your post and balked at the challenge of attempting to systematize a principled response...

This phrase caught my eye though.

This does sound like its own issue; where big traditional orgs have to contend with the ways of the modern world, there is certainly a lot of scope for tension and unpleasant office politics ... Perhaps I am overly combative, but I would not necessarily take every managerial request or bureaucratic dictact lying down; I would generally attempt to be assertive to challenge these, and have them themselves sufficiently justified. The choice of R packages you use would not seem to have a direct cost price associated that would warrant managerial oversight on a cost management basis; what other concerns do management have ? risk of some type of damage ? Presumably your R environment would be a sandbox; i.e. its not a realistic prospect that you would "delete your database" by experimenting with a problematic R package...; The core data at the heart of your organisation should be the concern of IT, and it should not be possible for you to do anything at all adverse to it ; at worse your sandbox would need restarting. Perhaps I suffer a lack of imagination in this area, but does that leave management much else to concern themselves with ?

2 Likes

wow! thank you SO MUCH for your thoughtful reply... my own standard in the past (with SPSS macros) has been " Associated Journal of Statistical Software introductory article". if unofficial macro has that, i can sleep well using it and investing my time into mastering it. but your list of 9 consideration criteria makes great deal of sense.... i'm just now stepping into R and amount of work required to evaluate every package added to my "stock" R seems truly overwhelming... I guess that's part of that steep learning curve for new R users i've been hearing about ))))). anyhow, thanks again for sharing your knowledge!! )

thank you for your thoughtful reply! i guess my issue is trying to figure out this whole R-thing myself... after that i'll be able to articulate justification for adding/ using various R packages to myself and to my bosses - who are not dumb, just conservative and their thinking... their stakes are high - highest-level clients across various sectors - lose one major client due to poor/untested analysis, and 20% of ur annual budget goes poof...

I had a similar question. Looking for a short answer to how to get the right download from the list of non 64-bit options which will work on my 2017 Mac OS laptop, and then to 'get started' by, say, getting a snippet of code to draw a plot of some probability distribution/s?

Just trying to get started. My one and only other question didn't attract any responses from the community! List of old versions of Rstudio: too many for me easily to determine the latest one for my system

Yours, Ria

@technocrat maked a great historic recompilation for software's developed like R. I'm enjoy this words. Tanks'!

1 Like

An interesting question by the OP, as well as the responses. I'll try and add to the spirit in two directions:

  1. Aside from the packages, how do you know (base) R or any other tool is correct?

By definition R is used more than the many packages that have been listed and has been worked on for decades by statistical experts. Despite this a change was made a few years ago (around 3.6, I think) to the method of generating random numbers when the previous method was shown to be biased. The benefit of correcting this was considered greater than the loss of backwards compatibility.

Other tools, such as Excel, have well-known bugs/"features" (including 1900 incorrectly included as a leap year due to Lotus 1-2-3 having that bug), but there are greater chances of these being exposed in open source software I would suggest than in commercial products (e.g. SPSS, SAS) with the source code not being subject to outside scrutiny.

Tangentially, here's a 10 minute audio on some spreadsheet disasters:
BBC Radio 4 - More or Less: Behind the Stats, Spreadsheet disasters

  1. I think a key skill with any analytical tool or programming language is knowing what can go silently wrong, rather than just relying on the functions or packages to give you the correct result. This includes things like floating point inaccuracies, which affect everything done on a computer. This classic provides some R-specific issues:
    https://www.burns-stat.com/pages/Tutor/R_inferno.pdf

Anyway, good luck in your R journey.

2 Likes

With more information, ain't broke, don't fix it comes to mind, since you're selling the output, not the tools. Unless the tools are spreadsheet-like, move slowly, shadowing projects with R implementation of code to pick up test bench of examples and also assess how closely output must resemble what's currently delivered.

See the R for macOS page. Not sure why 64-bit options are off the table— my old 2015 Airbook seems to have no problem.

RStudio has a cloud-based SAAS tool that you can use when OS support for your version sunsets. When Apple stops supporting a great-grandparent version so does RStudio, but the cloud version gives you independence.

To kick the tires with R, almost every function comes with examples in help() often using built-in dataseta, such as the ever-popular mtcars. Here's a sampling.

require(stats) # for lowess, rpois, rnorm
require(graphics) # for plot methods
plot(cars)
lines(lowess(cars))


plot(sin, -pi, 2*pi) # see ?plot.function


## Discrete Distribution Plot:
plot(table(rpois(100, 5)), type = "h", col = "red", lwd = 10,
     main = "rpois(100, lambda = 5)")


## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:
plot(x <- sort(rnorm(47)), type = "s", main = "plot(x, type = \"s\")")
points(x, cex = .5, col = "dark red")

hist(mtcars$mpg)

pairs(mtcars[1:4])
# the :: is for an optional package that has
# not been loaded with library(name_of_pkg)
GGally::ggpairs(mtcars[1:4])
#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2

Created on 2023-02-20 with reprex v2.0.2

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.