Here are a couple (olde school) ways, so pick your favorite.
set.seed(22)
L <- sample(0:5, 5)
# [1] 5 0 3 1 2
#### factor() lets you map multiple values to the same category
factor(
L,
levels = c(0, 1, 2, 3, 4, 5), # Use `0:5`, this is just for explanation
labels = c(1, 1, 1, 2, 2, 2)
)
# [1] 2 1 2 1 1
# Levels: 1 2
#### ifelse() is fine for splitting values into 2 groups
ifelse(L < 3, 1, 2)
# [1] 2 1 2 1 1
#### Using the cut() function gives you a factor
#### But it's a little overkill for making just 2 groups
cut(L, breaks = c(-Inf, 2, 5), labels = 1:2)
# [1] 2 1 2 1 1
# Levels: 1 2
#### Use lapply() or your preferred dplyr function to replace multiple columns
trial <- data.frame(
L = L,
M = sample(0:5, 5),
N = sample(0:5, 5),
O = sample(0:5, 5),
P = sample(0:5, 5)
)
trial
# L M N O P
# 1 5 3 2 5 1
# 2 0 5 5 3 0
# 3 3 2 3 0 5
# 4 1 0 1 4 4
# 5 2 4 4 1 3
trial[] <- lapply(trial, factor, levels = 0:5, labels = c(1, 1, 1, 2, 2, 2))
trial
# L M N O P
# 1 2 2 1 2 1
# 2 1 2 2 2 1
# 3 2 1 2 1 2
# 4 1 1 1 2 2
# 5 1 2 2 1 2
As a note, a factor vector is an integer vector with fancy labels and no order. An ordered vector is a factor vector with order, which means it's a fancy-looking integer. I like using them for all categorical data to:
- Give them nice word labels to improve the code's readability (
status == "healthy" is more intuitive than status == 1).
- Remind myself to never use them in numeric operations (R raised a warning if I try).
Choosing appropriate classes for variables, especially ones that restrict what you can do, can be a very useful thing. If you stick to factors, you're guaranteed to never silently end up with -1 for health status.