Why is base R's cut() output formatted the way it is?

cut

#1

Hi there, does anyone know why base R's cut() function formats output bins in the (] convention, as opposed to, say, () or []. Is there a way to change how the output is formatted?

example:

library(dplyr)

set.seed(1234)
d <- data.frame(
 int_var = sample.int(50, 50)
)

d %>%
  mutate(
   group_var = cut(int_var, breaks = c(0, 10, 20, 30, 40, 50))
  )

10


#2

It's math notation, such that ] means included in interval and ( means not included in interval :slightly_smiling_face:


#3

Thank you. Is there a way to reformat the output? E.g. in a different column with a mutate call?


#4

@Leon is correct! If you have a value that lands right on the cut (eg. precisely 10, in your example), you have to determine which bin it falls into. In your example, a value of 10 would fall into (0, 10], not (10, 20].

FWIW, you can use the right argument in cut to determine whether intervals are closed (ie. they include values) on the right (highest) or on the left. I don't think you can substitute the characters used in the notation without doing some further string processing, though. Converting the cut factor labels to strings and then using stringr functions inside mutate is probably the way to go.

EDIT: I should mention that you can also pass a labels argument to cut if you'd like to skip the automatically produced labels. You do lose the advantage of having nicely formatted automatic labels, though, so that may be more or less attractive an option depending on your needs.


#5

Here is a heavy-handed approach to reformat the output (which also loses information contained in the original notation):

library(tidyverse)
set.seed(1234)
d <- data.frame(int_var = sample.int(50, 50))

d %>%
  mutate(group_var = cut(int_var, breaks = c(0, 10, 20, 30, 40, 50)),
         group_var_temp = gsub(pattern = "\\(|\\[|\\)|\\]", replacement = "", group_var)) %>% 
  separate(col = group_var_temp, into = c("lwr", "upr")) %>%
  mutate(group_var_new = paste(lwr, upr, sep = " - ")) 

There is likely a more concise way to write the regex in the gsub, but as a regex novice I find that pattern intuitive (i.e., separately escape \\ each bracket type and combine with |).


#6

Possibly more efficient than changing the labels after the fact is writing a small utility function to create labels in whatever format you prefer. For example:

library(dplyr)

set.seed(1234)
d <- data.frame(
  int_var = sample.int(50, 50)
)

label_interval <- function(breaks) {
  paste0("(", breaks[1:length(breaks) - 1], "-", breaks[2:length(breaks)], ")")
}
  
my_breaks <- c(0, 10, 20, 30, 40, 50)

d %>%
  mutate(
    group_var = cut(int_var, breaks = my_breaks, labels = label_interval(my_breaks))
  )
#>    int_var group_var
#> 1        6    (0-10)
#> 2       31   (30-40)
#> 3       30   (20-30)
#> 4       48   (40-50)
#> 5       40   (30-40)
#> 6       29   (20-30)
#> 7        1    (0-10)
#> 8       10    (0-10)
#> 9       28   (20-30)
#> 10      22   (20-30)
#> 11      42   (40-50)
#> 12      41   (40-50)
#> 13      11   (10-20)
#> 14      35   (30-40)
#> 15      38   (30-40)
#> 16      47   (40-50)
#> 17      43   (40-50)
#> 18       9    (0-10)
#> 19      50   (40-50)
#> 20       8    (0-10)
#> 21      34   (30-40)
#> 22      33   (30-40)
#> 23       5    (0-10)
#> 24       2    (0-10)
#> 25      32   (30-40)
#> 26      21   (20-30)
#> 27      13   (10-20)
#> 28      39   (30-40)
#> 29      19   (10-20)
#> 30      44   (40-50)
#> 31      37   (30-40)
#> 32      26   (20-30)
#> 33      23   (20-30)
#> 34      45   (40-50)
#> 35       3    (0-10)
#> 36      12   (10-20)
#> 37      16   (10-20)
#> 38       4    (0-10)
#> 39      15   (10-20)
#> 40      17   (10-20)
#> 41      18   (10-20)
#> 42      20   (10-20)
#> 43      14   (10-20)
#> 44      46   (40-50)
#> 45      27   (20-30)
#> 46      49   (40-50)
#> 47       7    (0-10)
#> 48      36   (30-40)
#> 49      25   (20-30)
#> 50      24   (20-30)

Created on 2018-08-07 by the reprex package (v0.2.0).