SAS vs. R Discussion Prep

There is a huge amount of interest in our R user group in switching from SAS to R and we intend to take a meeting to address this, but no one is really confident enough to lead the discussion. This is going to be a high-stakes meeting because it will be attended by the professor who teaches the one (SAS) statistics course in our department, a professor from the Statistics department (from whom I expect model vs. model questions), and at least two other SAS, but R-curious professors.

  1. Would anyone like to lead/join this conversation via the internet?
  2. Recommendations for succinct and up-to-date resources on differences in SAS and R?

The prevailing opinion in our department is "Yes, R is good, but for many things SAS is still better." This leads to the following questions for most of us:

  1. But can I use R for all my statistical analyses anyway?
  2. When, exactly, is SAS better?
    a. How is SAS better - easier and faster, or more reliable?
    b. Why is SAS better - what are the stats reasons to believe SAS over R?
1 Like

I think a lot of it depends on what you use SAS for, and why people are interested in R - what are the motivations for your user group?

SAS and R are both great for basic analyses, but people often find SAS can become very unwieldy when they try to go beyond the basics and do more advanced analyses.

I work for Mango Solutions, and we see lots of organisations moving from SAS to R. I can't speak for academia, but in industry, the main motivations seem to be the cost of SAS licenses, matching skills of new graduates (SAS is being taught less and less, with R being taught more), and availability of the latest analysis techniques - typically things take a while to make it into SAS, compared to R where anyone can release a package on CRAN quickly. Certain R packages like shiny and ggplot2 are also motivators.

In response to question 1 in your second set of questions, I'm not aware of any statistical analyses that can be done in SAS that cannot be done in R. The opposite is more likely, if anything!

2a is a tricky one. I think SAS is easier to read than R for someone who doesn't already know the language, and is certainly easier to pick up. However, it's a trade-off against the greater flexibilty - and with the tidyverse packages and associated learning resources, R is a lot easier to learn than it has been in the past.

2b - I can't think of any stats reasons to believe SAS over R. That said, we sometimes see hesitance over the fact that anyone can submit a package to CRAN, which can be a bit of a worry to people more used to proprietary software than open source. I think here it's a bit of a mindset shift; typically the most widely used and well established packages have had so many people testing and using them, this isn't an issue.

All of the above said, I wouldn't say necessarily "R is better than SAS" or "SAS is better than R" - it all depends on what you need it for, and why you're interested in moving.

Hope that helps!

4 Likes

From a programming perspective:

As someone who learned SAS after already knowing how to program, it was something of an abomination. While R allows you to use poor programming techniques, SAS practically requires it (this is ignoring their "C++"-like interface that I have never used, so that may be completely different). Using the SAS Macro language means getting things done despite the language, not because of it.

In R, I used to have to dig into how to pull a p-value from a linear model using the matrix embedded in the list returned by summary.lm. Getting anything in SAS felt like that every single time.

In the end, I think becoming very good at SAS would make you an excellent SAS programmer. Becoming very good at R should make you a better programmer, period.

From a statistics perspective:

I have admittedly narrow expertise in SAS vs R in academia, but most of that experience was being taught how to run a statistical procedure in SAS and turning around to figure out how get the same numbers in R so that I could still do it once my academic SAS license expired. I had several professors that treated the SAS way of doing stats as the One True Way. I didn't always know enough to assess that judgement at the time, but the pieces that I dug into at the time and since then have shown that they had a pretty narrow view.

The most obvious example is Type I / II / III / IV sum of squares in ANOVA/regression. You can find volumes of discussion about this that I won't recap here. However, R can make it difficult to get exactly the numbers SAS gets without using just the right package and configuring it just so. I think that is the source of some academic distaste for R -- why do they use the "wrong" methods? But when you get down to how people actually use regression in "the wild", all the methods are wrong and you have to either look at more complex methods or just accept that being "kind of right" is often good enough. So, getting the textbook answers just isn't that important.

From a capability standpoint:

The most persuasive thing SAS has going for it is that it natively handles bigger data sets than R, and can analyze them as-is. In R, you have to step outside base, but not very far outside from what I've seen of Microsoft's XDF tools.

I believe SAS also addresses some issues with data control/auditing that are probably not well-handled by R directly. I would argue that there are other tools focused on making sure no one is fudging your data, so R doesn't need to do that.

In summary:

I've answered neither of your original questions. Well, I guess it's fair to assume that I've answered #1 indirectly, by making it clear that I'd be a terrible choice to lead this discussion. :slight_smile: I think the main reason to stick with SAS is that it's what you've always used, or it's what your adviser used, or your organization has a lot of investment in using SAS "right".

You can generally duplicate everything SAS does in R, but focusing on that will give you a skewed view of R because the people that are making R better often don't care about duplicating results that they may not agree with. It's like saying that I should use ABC Painting because their lead paint is top-of-the-line.

3 Likes

Thanks, Nic! Our group touches on a lot of R topics, but most of the users in our department (Agronomy) run some sort of Randomized Complete Block Design experiment, usually leading to ANOVA and Tukey's or mixed effects models and contrasts. The book for our departmental stats class is : Design of Experiments: A No-name Approach .

Time series analysis is also common in our discipline (or should be).

There was a student who ran an ANOVA in SAS and found a siginificant relationship then ran the same analysis in R and did not find a significant relationship. The conclusion was that R failed and was the inferior software. Pointing out that SAS and R run on different assumptions did not help the case because then R just became the "harder to understand" software. Sometimes the R^2s we get in R are not as "good" as the ones we get in SAS. These anecdotes spread fast and turn many people away from using R for their stats.

Others want to truly understand the statistical differences and those are the people I expect to participate in the user group discussion.

Thanks, other Nick! Your "statistics perspective" is exactly where I think most of our group members are. They want to switch to R because they can't buy SAS forever, but they want their answers from R to be identical to their answers from SAS. But getting there takes so much effort that they assume they are doing it wrong or it's just not worth it.

We do have some members who are committed to making the switch from SAS to R and these are the ones that will show up with very specific Type I / II / III / IV sum of squares in ANOVA/regression questions that make me want to cry.

I haven't used it myself, but I've seen recommendations for translating SAS -> R using this book:

I'm not sure of the quality of the R code, though, and I doubt it's tidyverse-centric, so it doesn't necessarily help convince someone that R is better.

Edit: And yeah, my homework probably took 3-5 times as long as it would have if I had just been happy to do it in SAS. The fact that SAS had a bit of a stranglehold on the master's program in my department is part of where my distaste for it comes from.

1 Like

my contribution might not apply much to Agronomy, but when I teach the R class at my local university, I tell my students in the first class that they need to also take the SAS class. I tell them this because there are a lot of jobs to be sought in good, high paying careers where the businesses have decades of SAS code built and validated for regulated trials. They aren't likely to switch anytime soon, because validating new code routines is expensive.

We had a visitor talk to our students a few weeks ago about the regulated trials consulting firm he runs. They employ 50 people and hire 2-3 each year. The technical part of their interview is almost entirely SAS based. No mention of R anywhere.

So one argument for students learning SAS is that there are jobs available for SAS programmers.

For established academics, I'm not sure the economics work quite the same way.

1 Like

I haven't used SAS enough to feel confident teaching with it, but a resource is

R for SAS and SPSS Users - Muenchen, Robert A, 2nd ed.

The statistics section in this book covers the various assumptions about why things differ between R or SPSS or SAS, in particular the default for R being Type I whereas in SAS you might be expecting Type III. To quote a few paragraphs of p633

"17.13 Sums of Squares
In ANOVA, SAS and SPSS provide partial (type III) sums of squares and F-tests by default. SAS also provides sequential sums of squares and F-tests by default. SPSS will provide those if you ask for them. R provides sequential ones in its built-in functions. For one-way ANOVAs or for two-way or higher ANOVAS with equal cell sizes, there is no difference between sequential and partial tests. However, in two-way or higher models that have unequal cell counts (unbalanced models), these two sums of squares lead to different F- tests and p-values.
The R community has a very strongly held belief that tests based on partial sums of squares can be misleading. One problem with them is that they test the main effect after supposedly partialing out significant interactions. In many circumstances, that does not make much sense. See the [64] for details. If you are sure you want type III sums of squares, ..."

So a big part is understanding the functions you are working with, rather than treating them as a black box.

4 Likes

To me, the "natively handles bigger data sets than R" issue is one that is more of a problem of a decade ago where the world has moved on. I tend to feel that if you need to analyse all the data at once and it is more than the memory of one machine can hold, then fundamentally it is a cluster/cloud problem these days.

To me the primary distinction is if you are wanting to do the same thing other people are doing under the same conditions, SAS's environment provides security. If you are wanting to do things people haven't done, R's pace of development provides opportunity.

4 Likes