Use mutate to add large number of levels to a factor variable

Suppose I have a character variable I wanted to convert to factor with the mutate function.
The variables has hundreds of values. Manually adding levels isn't the most efficient way to go. want to be able to do: mutate(sleep_total_discr = factor(sleep_total_dscr, levels = sleep_total_discr)

Here is a reprex example that has only 4 levels. Manageable by adding 'levels' manually.

msleep %>%
  select(name, sleep_total) %>%
  mutate(sleep_total_discr = case_when(
                                        sleep_total > 13 ~ "very long",
                                        sleep_total > 10 ~ "long",
                                        sleep_total > 7 ~ "limited",
                                        TRUE ~ "short")) %>%
  mutate(sleep_total_discr = factor(sleep_total_discr, 
                                                      levels = c("short", "limited", "long", "very long")))

Hello,

I might be missing something, but can't you just use as.factor:

msleep %>%
  select(name, sleep_total) %>%
  mutate(sleep_total_discr = case_when(
    sleep_total > 13 ~ "very long",
    sleep_total > 10 ~ "long",
    sleep_total > 7 ~ "limited",
    TRUE ~ "short")) %>%
  mutate(sleep_total_discr = as.factor(sleep_total_discr))

PJ

2 Likes

You want to create a column of factors based on a column of integers? Are all the relationships greater than? Do you have a list of the cut offs and factor values? If so you can probably write a little function to do the lifting for you.

library(tibble)
library(dplyr)

msleep <- tribble(~name, ~sleep_total,
                  "Alice", 10,
                  "Bob"  , 20,
                  "Carol", 30)

categorize_sleep <- function(totals){
  list_of_thresholds <- c(53,51,50,25,15,12,9,7,3)
  list_of_categories <- c("exceedingly long","exceeds long",
                          "long","middling","medium-rare",
                          "jaunty","short","pretty short","quite brief")

  positions <- sapply(totals, function(x) min(which(x > list_of_thresholds)))
  factor(list_of_categories[positions], levels=list_of_categories)
}

msleep %>% mutate(categories = categorize_sleep(sleep_total))

# > msleep %>% mutate(categories = categorize_sleep(sleep_total))
# # A tibble: 3 x 3
# name  sleep_total categories 
# <chr>       <dbl> <fct>      
# 1 Alice          10 short      
# 2 Bob            20 medium-rare
# 3 Carol          30 middling 

If you're reading a table of thresholds and labels, you could pass them to your categorize_sleep function instead of hard coding them.

1 Like

I like this solution, but you will need to add "droplevels()" to clean up unused levels! And again, if your
list_of_categories were in hundreds or even thousands, you will have to manually populate the levels in the vector, right? Or is there any other way?

Thanks PJ. your suggestion works, but not sure why we should not be able to use "factor" and "levels" from mutate. I even tried levels = msleep$ sleep_total_discr.

Yes. Change the categorize sleep function to accept the lists as params and pass them in from wherever you're keeping them.

categorize_sleep <- function(totals, my_thresholds, my_categories){ ... }
1 Like

You should be able to use factor and level with mutate. Can you please provide a reproducible example illustrating the problems you're having with your code?

But if your levels are unordered, you do not need to specify the levels. Using just factor and as.factor is enough. Compare z3 and z4, and check their levels and that of z2.

set.seed(seed = 47715)

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

fake_df <- tibble(x = seq.int(to = 30),
                  y = runif(n = 30,
                            min = -100,
                            max = 100))

fake_df_mod <- fake_df %>%
    mutate(z1 = case_when(y < -50 ~ "very_low",
                          y < 0 ~ "low",
                          y < 50 ~ "high",
                          TRUE ~ "very high"),
           z2 = factor(x = z1,
                       levels = c("very low", "low", "high", "very high"),
                       ordered = TRUE),
           z3 = factor(x = z1),
           z4 = as.factor(x = z1))

str(object = fake_df_mod)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    30 obs. of  6 variables:
#>  $ x : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ y : num  86.806 13.028 -0.782 -17.397 99.254 ...
#>  $ z1: chr  "very high" "high" "low" "low" ...
#>  $ z2: Ord.factor w/ 4 levels "very low"<"low"<..: 4 3 2 2 4 NA NA 4 NA 3 ...
#>  $ z3: Factor w/ 4 levels "high","low","very high",..: 3 1 2 2 3 4 4 3 4 1 ...
#>  $ z4: Factor w/ 4 levels "high","low","very high",..: 3 1 2 2 3 4 4 3 4 1 ...

Created on 2019-12-20 by the reprex package (v0.3.0)

Another point is that based on your code, cut seems a better option than case_when in this specific scenario.

2 Likes

mtcars %>%
mutate(car_mode = row.names(.)) %>%
select(car_mode, cyl) %>%
as.tibble() %>%
mutate(car_mode = factor(car_mode), levels = car_mode)

Thank you all for responding. I learned something from all your points. It was the case of two different R version, two different results.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.