Question for office hour: R vs Python


#1

This is great. RStudio is my favorite of all time, and this group is awesome.

About me:
R user for 6+ years

My question: R vs Python
Python is replacing R. If you don’t know Python, you can’t get a job!

The main complaint is that R is SLOW. Python is fast, but has no IDE close to beating RStudio. Packages like Numpy and Scipy are spin-offs from R.

As a leader in the R community, what are your plans to improve R? How to sustain the competition without losing?

Thanks,

-M


Help me choose a second language
#2

Hey @mike ,
Is there any data to back up the claims about Python and R? I.e. “you can’t get a job”?


#3

Actually, Emacs is much better than RStudio :wink:


#4

@taraas I use indeed.com with the term “data science”. https://www.indeed.com/jobs?q=data+science&l=

Nine out 10 jobs ask for Python. They use Numpy, Pandas, and SciPy for data manipulation, even though R is much better at data manipulation (indexing at 1 not 0). I was told scikit-learn in Python is faster and friendlier than its R equivalent (caret?).

If you are doing deep learning, CV, NLP, then TensorFlow, PyTorch, Theano in Python dominates over R. R doesn’t have much package, even if there are, they are slow.

Start-up companies are using Python as their pipeline, meaning web scraping, ETL, and present deliverables to the client. And yes, they do use R, only to make ggplot2 graphs.

In the old days, it was R vs. SAS. Nowadays, places loyally using R are academia and med schools.


#5

Are you currently employed & using R or Python?


#6

I don’t think Python is replacing R, per se – rather, I see it more as two camps from which data science is emerging:

  1. The statistics side, where R is most commonly used (alongside other languages like SAS, SPSS, Stata, and others)

  2. The computer science side, where Python is most commonly used as the interface to machine learning libraries.

I don’t think the R camp has been getting smaller, but I do think the Python camp has been getting relatively larger as you have a lot of practitioners with training in software engineering that have been drifting towards data science, and that implies using the tools available in the software ecosystem those users are most familiar with (ie, Python). If you’re interested in a bit of analysis on this topic, you might enjoy this blog post. Although it focuses on the growth of Python, you’ll notice that R is growing quickly as well.

As for the emergence of machine learning libraries, we’re working hard at RStudio to make sure that R also has a first-class interface, with e.g. the tensorflow package and the keras, which essentially provide parity with the Python interface to these libraries. We definitely don’t want R users to feel left out of the developments made by the machine learning community here!

As an aside, I generally disagree with the assertion that R is slow; I’d argue that it’s ‘fast enough’ for most tasks, and packages like dplyr help make larger datasets more accessible within R. (Python itself is often criticized as a ‘slow’ language, but packages like numpy and scipy make it possible to efficiently manipulate data structures as well). In that realm, RStudio will continue to work hard on producing R packages that make it easy to efficiently clean, transform, manipulate and visualize data.


#7

I keep hearing this complaint that “R is slow” - even from my professors. But while that may have been true years ago, is it really true now?

It seems like there are all sorts of steroids you can inject R with to make it fast enough for most tasks.

If you have a giant dataset, you can just store it in a mySQL server or something, push queries from R.

If you have complex computations, just profile your code and use RCPP package to translate the bottlenecks to C++.

You can use dplyr and/or data.tables for manipulating large data sets in memory.

With all these add-ons, is it really true that R can no longer be competitive? Python may have an edge in that it’s easier to integrate data analysis done in Python with larger tech product. But this seems like a trivial issue as well.


#8

In my experience, many professors that use R will try to use it like they would some other programming language – lots of loops/poor vectorization, using multiple steps when one would do, or even just word-of-mouth. It also may be that they self-ported an algorithm that they are very familiar with and has been extensively tuned in another language and didn’t get good results on the first pass.

As you stated, there are lots of ways to speed up R if needed. But there’s no sense in doing so until/unless the speed of your code is actually slowing you down!


#9

speed is to some extent a function of the fluency of the author in the respective language; for instance, i’m more comfortable with complex problems in R than in python, so my R code is much faster than my python code. probably python would win if someone is really skilled in both, because base python is probably faster than base R, but really as noted so much depends on packages used, code approach, etc.


#10

Some of my opinions on industry and academia trend, by no means right or perfect.

On the industry side

Advertising guys doing machine learning:
Recommender systems is very popular. It focuses on ads click-thru-rate and online learning. Their datasets are sparse and very large. So, you need language that’s very fast. On the smaller scale, you may be A/B testing between two products.

The deep learning guys:
Neural network, computer vision, LSTMs, and more recently, at the frontier is GAN.

The NLP guys:
Topic models with very large datasets, and recently, with more interest on sentiment analysis. I haven’t seen too much on sentiment analysis, even in Python.

Facebook and Google are using Python.

On the academia side

I see quite a few CS grads tenured as statistics professors, which leads me to think that computing will continue to play a large role in statistics. Here’s a list of NIPS 2017 accepted papers.

In summary
My narrowed opinion tells me that deep learning is the next wave in the field.


#11

The RStudio crew is doing amazing work in improving R and they don’t seem to be slowing down. Python is gaining popularity because there are more software guys getting into applying ML / Data algorithms than there are Statisticians. In terms of language strength I don’t know if there is anything inherently better about Python vs R. (Someone who uses both should comment)

If anything what I’ve found is if I cant get something to work correctly in R it has always been me that’s at fault. If I switch to Python I dont see that changing :slight_smile:


#12

When I say slow, I don’t mean just clock time. But, how two languages were conceived.

In R, all variables are stored in the namespace. In Python, the story is different,

# Python exmaple
> a = [1, 2, 3]
> b = a
> b[0] = 10
> a
[10, 2, 3]

It’s passing the address of a to b. Changing b will change a. You can look up soft copy vs. hard copy.

Another example, it’s ok to initialize vectors in R like this,

# R example
x <- y <- z <- c() 

In Python, this is passing addresses again.

R wins in my book, it’s more intuitive. But, when your a is, say, 1GB, what will you do?


#13

@mike R uses addresses as well. For example, using the address function from data.table, I see:

> library(data.table)
> x <- y <- z <- c() 
> address(x); address(y); address(z)
[1] "0000000000180788"
[1] "0000000000180788"
[1] "0000000000180788"

(Other packages, like pryr, have functions like data.table::address as well.)

Similarly, I can do

> DTa = data.table(matrix(, 1e6, 100))
> DTb = DTa
> address(DTa); address(DTb)
[1] "00000000061BE2C8"
[1] "00000000061BE2C8"
> object.size(DTa)
400016912 bytes
> system.time(DTb[1, (1) := FALSE ])
   user  system elapsed 
      0       0       0 
> DTa[1, 1]
      V1
1: FALSE

That is, I make DTb a reference to large object DTa and see modifications of one in the other.

Generally, if you think about what you’re doing, you can make it fast enough.


#14

Hadley’s Advanced R book has a good discussion of the basics of R’s memory usage:
http://adv-r.had.co.nz/memory.html#object-size

As @Frank says, it’s more complicated than just assuming that every object has a distinct memory space.


#15

I’d phrase this even more strongly than Kevin - the R camp is still growing, just not quite a rapidly as the python camp. For example, take the number of questions on SO tagged with R and python:

R is definitely still growing! You see a similar story if you look at the total number of downloads from CRAN.

It’s hard to pin down exactly how many people are using R, but all of the signals I’ve looked at (SO, cran downloads, github usage) show that R is continuing to grow.


#16

See also this figure from David Robinson’s recent stackoverflow blog post: https://stackoverflow.blog/2017/09/06/incredible-growth-python/

R is one of the fastest growing languages!


#17

All I see from this graph is that Python users are getting more confused at a higher pace! :joy::rofl:


#18

From the data, it’s hard to tell why people are downloading R.

It could be a bunch of grad students using R for course requirement. Maybe! We don’t know!

Data science bootcamps are attracting Ph.D. grads from all fields. Here’s a look at a 8-week data science boot camp for $16,000:

Part 1) Python programming and SQL
Part 2) Machine learning
Part 3) NLP, etc. (deep learning: neural network, reinforcement learning, cv)
Part 4) a project

They are using Python for data science. Yes, R is used here and there. But, for machine learning, they use scikit-learn.

We are in the age of deep learning! R just doesn’t have enough, yet. That’s what I’m trying get at with this post.


#19

To be honest, when you talk about any type of ‘data science’ in industry, you’re often talking about things like Microsoft’s PowerBI, SQL Server and so on. The gap between the data science I’ve studied and practiced, for example (Bayesian modelling, Stan, R, JAGS etc etc) and what is actually out there is pretty big (it’s often just descriptive stats of things of interest to business managers – like an interactive PowerPoint presentation). So both Python and R are quite exotic for most people (at least where I work, in São Paulo). I agree that Python has advanced very quickly lately in Deep Learning/ML, but it’s early days for the applicatiuon of these things into everyday business activities, so who knows? And like we’ve been discussing over on this thread https://community.rstudio.com/t/r-python-in-ide/279/5, the best thing is to integrate all the best tools (Python and R are not sooo different) in, ideally, a great IDE like RStudio.


#20

One other thing about the growth of Python vs. R using SO as a proxy is that Python is truly an all purpose language that is used for web development and used for writing shell scripts. It’s difficult to parse out how much of the Python growth is due to data science itself. David Robinson said he would be writing a blog post about that soon, and I’m looking forward to seeing it.