Useful R base functions

Hello everyone,

I'll teach an "avanced" R course for my doctoral school in a month now.
I want to dedicate at least one session (out of 9 or 10) to base R.
Obviously, a part will be about the different accessors.
Yet, I also want to present some useful R base functions. Note that I don't want to talk about functions like subset or transform because we will use dplyr instead for this kind of operations.
Yet, I remember the first time I discovered the match function, I found it very useful and now use it very often. There are some other functions like this one that deserve to be known, for example sub, cut, split, outer..
A complete list of R base functions is available there.

So, here my question is: do you know some other R base functions that are super useful and should therefore be mentioned in my course to come?

2 Likes

For a bit of inspiration you might wanna look in the vocabulary section of the old Advanced R edition http://adv-r.had.co.nz/Vocabulary.html

You can also find short descriptions next to the functions in @peterhurford and Roberts Solutions repository

3 Likes

I think my favourite base R function is rle() (I typically use either the lengths or values element of its output). I don’t use it often, but when I do, I find it incredibly useful. And I sometimes use its output in cumsum(), another handy function. (All the four cumulative functions (try apropos("^cum") are very useful.)

The Vectorize() function is useful, for example in combination with the outer() function that you mentioned. But note that it only does ‘surface vectorisation’; it doesn’t actually make the function fast, like ‘real’ vectorised functions tend to be.

To find the position of duplicates, duplicated() (perhaps followed by which()) does the trick. And to remove duplicates, use unique() (this also works on data frames). To extract the elements in a vector A that are not part of a vector B, setdiff() is nice (the other set operators listed on ?setdiff are also useful, but I find that I use setdiff() the most).

When handling logical values, I frequently use the all() and any() functions.

I don’t really like cut(). One would think that if an element x in a vector gets the label (a, b] when you run cut() (with implicit breakpoints) on the vector, then x lies in the interval (a, b]. However, this is not necessarily true. See http://r.789695.n4.nabble.com/Binning-numbers-into-integer-valued-intervals-or-a-version-of-cut-or-cut2-that-makes-sense-td3692872.html for an explanation, with my alternative cut3() function, which does work the way one would expect.

I also dislike split() somewhat. It reorders the groups alphabetically instead of keeping the original order. That’s a big source of bugs if you’re not aware of this ‘feature’.

2 Likes

If you teach an advanced course on R, what do you mean? An advanced data mangling course, or an advanced programming course? Because when giving an advanced programming course, I expect my students to know the most common base functions already, so I can explain them about:

  • cut()
  • browser()
  • mapply() (assuming they know what lapply() and sapply() do)
  • do.call()
  • grep(), sub(), gsub(), grepl(), strsplit(), substring(), and other text manipulation
  • outer() (one of the most underestimated functions in R)
  • with() and within() for convenience, with a warning as they're not supposed to be used inside functions.
  • match() and %in% and how they differ
  • table(). Yes, people forget far too often about table()
  • stop(), warning(), message() and why to use them.
  • match.arg()
  • crossprod(), tcrossprod(), svd(), eigen(), qr()

I focus heavily on vectorization in calculations, recycling and basic scoping in functions. I also focus heavily on object types and classes and why they matter. I dive into the details of handling different types of special values (Inf, NA, NaN, NULL). I look into preallocation using integer(n), character(n), logical(n) and so forth.

On a sidenote: the functions shown in the answer of @Tazinho I cover in mostly in my introduction classes actually. (apart from get, assign and the global assignment).

subset and transform are functions I advise my students NOT to use for obvious reasons.

Along with preallocation (integer(n), numeric(n), etc) it is useful to introduce seq_along and seq_len.

1 Like

@huftis Please check the arguments breaks and labels of the function cut. You need those to harness the full power of it. The arguments right and include.lowest give you further power to do what you need:

x <- rnorm(100)
str(cut(x,
        breaks = c(-Inf,0,Inf),
        labels = c("Below 0", "0 or more"),
        right = FALSE))

I see your point about the default labeling. But if you check the code, you see that dig.lab gives you the number of significant digits, not the number of digits after the decimal. And the problem with the code in that post is that the labels are wrong, not the binning :slight_smile:

cut(c(20.8, 21.3, 21.7, 23, 25), 2, dig.lab=2) 
#> [1] (21,23] (21,23] (21,23] (23,25] (23,25]
#> Levels: (21,23] (23,25]
cut(c(20.8, 21.3, 21.7, 23, 25), 2, dig.lab=3) 
#> [1] (20.8,22.9] (20.8,22.9] (20.8,22.9] (22.9,25]   (22.9,25]  
#> Levels: (20.8,22.9] (22.9,25]

So the first border is actually 20.8, not 21.

@jorismeys Yes, I know about the breaks and labels arguments. And it is exactly when you don’t use these arguments, i.e. when you let the function itself determine the break points and the labels, that you get surprising (and arguably wrong) output.

If the labels don’t agree with the binning, it’s really arbitrary which of them are incorrect. But, as in your example, the number 23 is put in the bin labelled (23, 25] instead of in the bin labelled (21,23], it’s difficult not to see this as rather strange behaviour of the function.

@huftis No, the number 23 is binned in (22.9, 25]. It's not because it says 23 when you round to 1 significant digit, that it means "exactly 23".

That is not wrong output. That is a user who doesn't realize that 1 significant digit is not enough when you're interested in 3. I haven't heard anyone complain about how rounding numbers in eg tibble prints give "an arguably wrong output" because they had 22.9 in the data and the settings they use show it as 23.

This is the exact same. It looks confusing, I agree. But we're discussing about an argument I have literally never used in my entire life. And I use cut() as a standard solution for creating categorical variables from continuous at least once a week.

EDIT: To make the point: this is actually using cut() without optional arguments. What is wrong with that output, other than the fact that splitting into equally sized intervals gives you break points one wouldn't use oneself?

cut(c(20.8, 21.3, 21.7, 23, 25), 2)
#> [1] (20.8,22.9] (20.8,22.9] (20.8,22.9] (22.9,25]   (22.9,25]  
#> Levels: (20.8,22.9] (22.9,25]

@jorismeys If this was only a display issue, this would be less of an issue. Rounding numbers for display purposes, e.g. when printing a tibble, is expected. (Compare this to the old summary() behaviour, which was a problem.)

But the actual bin break points are not actually stored anywhere in output object. So the user has no idea that the internal bins are different from what it ‘says on the tin’. What the output of cut() tells you is basically:

– I have created two bins, (21, 23] and (23, 25].
– I have put the number 23 in the (23, 25] bin.

I‘m not sure why you mentioned ‘1 significant digit’. Both the input number (23) and the numbers in the output ranges ((23, 25]) have two significant digits, and dig.lab = 2.

Code like this may easily be used for creating statistical tables to be published, e.g. when categorising age into integer-based bins. The resulting tables will be plainly wrong. Of course, typically you will manually choose the break points. But even here, cut() gets the placement for the number 23 wrong. Try setting breaks=c(20,23,25). No matter what value you set dig.lab to, 23 is put in the (23, 25] bin instead of in the (20, 23] bin in which it mathematically belongs.

Even though you are not, I guess many people might be inclined to believe that if R puts a number in a (a, b], that number is indeed mathematically in that range. It’s of course OK to use the cut() function (and I use it myself) – as long as one is aware that this is not true, that ‘what you see might not be what you get’.

@huftis not on my computer.

cut(23, breaks = c(21,23,25))
#> [1] (21,23]
#> Levels: (21,23] (23,25]
cut(23, breaks = c(21,23,25), dig.lab = 1)
#> [1] (21,23]
#> Levels: (21,23] (23,25]

is correct. I agreed before that using dig.lab can give confusing output if you don't read the help page carefully:

It determines the number of digits used in formatting the break numbers.

The break is 22.8. Formatting it with 2 significant digits rounds it to 23. But 23 is still larger than 22.8, even when formatted as 23.

Confusing? yes. Wrong? no. But I agree this should be explained when explaining the cut() function. I didn't realize so many people ran into trouble with it. I'll add it to my courses as well.

PS: I mentioned 1 digit as in the original post you linked to, they used dig.lab = 1.

@jorismeys Hm, now I too get the correct binning when manually choosing the bins. That’s reassuring. Not sure why I was thinking this didn’t work. Perhaps I just misread the output. Sorry for the noise. :pensive:

I’ve always (i.e. after being surprised about it in my e-mail to the mailing list) understood how cut() actually works. I just didn’t like it! (And that’s why I created a cut3() function, where the labels are consistent with their contents.).

Basically, my model when I first read about and used cut() was:

cut() will use the various arguments (including dig.lab) to construct a set of non-overlapping sub-ranges (labels) which (completely) covers the input vector range. Each element in the vector gets the label corresponding to the range label in which it falls. So, e.g., 23 will get the (21,23] label in the example with integer-values labels.

Instead, the following happens.

cut() will use the various arguments (but excluding dig.lab) to construct a set of non-overlapping sub-ranges which (completely) covers the input vector range. Each element in the vector gets assigned to the sub-range in which it falls. Each sub-range is then formatted using dig.lab (which changes their meaning if one where to interpreted the labels mathematically, but their actual content is not changed).

So when one gets the labels (21,23], (23,25], one must be careful not interpret these as mathematical ranges into which each element is assigned. Instead, one must think, cut() has generated two ranges – similar but not necessarily identical to (21,23], (23,25] – and which is not part of the output (so I don’t really know what they are, but I can probably get a good guess by experimenting with increasing dig.lab) and has binned my input into these internal ranges. So I really shouldn’t be surprised that the number 23 is put in the (23,25] range instead of in the (21,23] range.