Question for office hour: R vs Python

I had java at uni for a year and it was pretty helpfull to understand object orientation. C seems a good place to start to learn about memory management. I am not interested in learning to code whole application in C, but rather program (simple) alogirthms that I can call from R.

More correctly, some R packages use addresses as well. R itself is call by value.

I don't think call-by-value is a great way to describe R because it implies that when you call a function all the arguments are copied. That's not true in R because functions are effectively call-by-reference, but objects are immutable so whenever you think you're modifying an object you're actually making a copy first.

2 Likes

The essence of the matter is that R itself does not have common ways to use a replacement function as a call-by-reference in the way eg data.table does. The behaviour of R is call-by-value, even though the technical details are a bit more complex if we take lazy evaluation of promises and the copy-on-modify into account. So even when one can say that technically the actual expressions in the arguments are stored as promises, only evaluated whenever the argument is used inside the function and copies are only created when an object is modified, that’s not the same as being “effectively call-by-reference”.

And we’re talking language definition here, not internal mechanisms. Per section 4.3.3 Argument Evaluation of the R language definition, R is defined as “call-by-value”.

This answer by Gavin Simpson sums it up pretty nicely: https://stackoverflow.com/a/15764949/428790

3 Likes

Even though the R language definition uses that term I still don't think it's a good characterisation, because you can pass an object with reference semantics like an environment or a data.table, and it is not copied/cloned.

Regardless, I think we can both agree that neither call-by-value or call-by-reference are good abbreviations for R's behaviour. R differs in important ways to both. Here's a little example showing a single function behaving like call-by-value or call-by-reference depending on the input:

f <- function(x) {
  x$y <- x$y + 1
  invisible(x)
}

x1 <- list(y = 1)
x2 <- list2env(x1)

x1$y
#> [1] 1
x2$y
#> [1] 1

f(x1)
f(x2)

x1$y
#> [1] 1
x2$y
#> [1] 2
7 Likes

One question, in the future will something like scikit-learn will exist in R ?

I am not sure, how much efford is made into this project and how general the scope will become, but there is a very new and interesting CRAN package called mlapi, which basically tries to achieve scikit-learn like ml workflows in R.

Also the description looks quite confident

2 Likes

@Mos_Taf There's already caret (but that depends on you definition of what you mean by "something like scikit-learn").

There is more coming in this area too.

Can you be more specific about what you are interested in?

5 Likes

I guess since the broader announcement of reticulate it is obvious, that more python stuff will be ported directly into R and the h2o4gpu pkg seems to be a great first example of a serious effort using this new option.

3 Likes

Reading through the thread I can relate most to this answer by @kevinushey, and in my opinion there is an additional level to the discussion. In our experience, if you want to introduce a language to interface with data and analytical methods in an organization, R is a far easier and productive road to travel than Python. Not because it is a better language, but because it is a better fit. I will try to explain why I think so.

From an innovation perspective, grass roots initiatives with data innovation (nowadays) tend to start with R, or with people transitioning from spreadsheets to R. What we have seen in many cases is that a good professional, say in risk analysis or marketing, can solve a problem or optimize a solution programmatically with R, with just the internet, grit and support by helpful strangers in the R community. These people can add tremendous unexpected value to an organization, and are often the starting point for a data science / data analytics group.

But the other way around seldom occurs, were a good developer in the IT department takes on a domain specific problem by her/himself (with only the internet and grit for guidance), and solves it without the input of a person with domain knowledge. There is no blame/better/worse here, it just not the way things tend to work.

From a change management perspective, even in situations and organizations with teams that are highly learnable, R tends to be the easier way to introduce a data language to democratize data innovation.

Whenever we have tried to teach Python to people without a technical background (one that included at least some programming), it was usually difficult. The level of abstraction is high, and it does not make a click to day to day problems as easy as R does. It has often been much easier to make people feel empowered with their newly learned R skills, not because it is so much easier to learn R, but because it is relatively easy to point to a niche of R documentation and R users solving similar business problems as their own.

By the same token, we have tried to teach R to programmers (with Java and/or .net in their backgrounds), and programmers tend to get frustrated quickly with what they perceive as quirks in the R language. They are less enthused with highly verbose documentation and examples, or worse, with having to deal with packages that were clearly written by non-technical people.

To my mind, language popularity indicators say very little about the value of a programming language. For a business context, where there is a need for data driven innovation, a desire to experiment and a need to bring new methodologies to production R can be a great fit. Especially if there are non-programmers willing to make the effort to learn and work with R.

Remember that there is hardly ever only one language in production in an organization. And one of the great strengths of R and the R ecosystem today is that it allows non-IT people to bring robust data products to production that can then be consumed by other systems, other languages, and other teams.

7 Likes

@FvD I actually agree with your comments below.

Whenever we have tried to teach Python to people without a technical background (one that included at least some programming), it was usually difficult. The level of abstraction is high, and it does not make a click to day to day problems as easy as R does.

But that's not the main point, because Python is the hype. I even asked friends.
Me: "Why you learning Python?"
Him: "Because that's what everyone else is learning."

It's like: this year baggy jeans is in, skinny jeans is not.

If you use R or Matlab, you probably find Python not so great for data science. Not a mature language. I'll list a few here.

Problem 1: Indexing starts at 0, and does not throw out of bound error.

> a= [11, 22, 13, 54]
> a[1:100] # does not give indexing error
22, 13, 54

Problem 2: In Python, you read CSV files using Pandas. But, Pandas can't do matrix computation like R.

df = df.div(df.QT, axis='index')

You use df.div() function. it's a mouthful, not intuitive.
For native matrix computation, you coerce data frame to Numpy, but lose all attributes.

But, Python IS very good at machine learning modeling. It's fast! R needs a comprehensive equivalence of scikit-learn.

One puzzling question is that: R is the no-brainer choice when it comes to statistics, and machine learning is mainly built on statistics, how come that python became the more popular choice of ML?

I guess it is because there are more computer science guys than statistics guys. The language-specific reason could be that R community did not try as hard as Python community to make the language more performant (Numba, Cython, etc), which is key to ML due to its computational intensity.

Also, R support to some ML libraries did not came up as early as Python.

Life's too short for language wars!

What do you want to do?

  • Do you want to perform iterative, exploratory data analysis and complicated data visualisations with reporting? Then R is your friend
  • Do you want to write command line software, e.g. huge pipelines of wrappers of other CL SW to be deployed on HPC systems? Then Python is your friend
  • Do you want to implement complex computationally expensive iterative algorithms? Then C/C++/Fortran is your friend
  • Do you want to interact with above mentioned complex algorithms in C/C++/Fortran? Then R or Python is your friend

In fact, I more and more see both R and Python as APIs for interacting with much faster languages such as C/C++/Fortran and the whole R-is-slow thing - Yes, it's slow if you are iterating millions of times, but so is Python. In either case, it's like going to movies to watch a kids movie and then complain that it was immature and juvenile - Your expectations are off-target. If R/Python is too slow for your problem, then likely you've chosen the wrong tool for the task.

Choose your tool, whichever floats your boat, whichever appeals to your way of thinking and is ideal for the task you are to solve :+1:

And as always, hope it helps :slightly_smiling_face:

8 Likes