Tidyverse Normalization Alternative To the Box-Cox Transform?


Given that tidyverse is mainly motivated by exploratory data analysis and that the Box-Cox transform is similarly motivated and that “select” is about as indispensable a method as you can get in the tidyverse and that the MASS package is one of the most if not the most widely cited source of the Box-Cox transform and that MASS has its own “select” that masks tidyverse’s “select”…

…a tidyverse newbie such as myself is left to ponder that perhaps he has missed the tidyverse EDA tool that performs the equivalent of data normalization (which is the primary purpose of the Box-Cox transform).


You don’t have to attach the MASS namespace with library just to use one function. Just refer to it as MASS::boxcox(). The tidyverse, at least at the moment, provides little in the way of statistical algorithms – which is probably a good idea to avoid becoming impossibly large.


In context, the most likely way to apply a Box-Cox transformation in a tidy toolchain would be recipes::step_BoxCox. A bit of reading for how its framework works will probably be necessary, though.

Prior to recipes, caret::preProcess with method = "BoxCox" filled a similar role.


Note that caret and recipes use the BC transform to modify the predictors; it was created as a method for transforming the outcome in a linear regression.

If you’re interested in transforming the outcome, I don’t think that there is a tidy solution (yet). For the predictors (or any other variables in isolation), recipes is probably your best best. Also, the Yeo-Johnson transformation is the same but allows for negative and zero values in the data too (also in recipes).

Some example code:

> library(recipes)
> set.seed(3215)
> dat <- data.frame(x = exp(rnorm(100)))
> head(dat)
1 1.156
2 0.326
3 0.584
4 1.025
5 1.736
6 0.840
> # create the recipe
> bc_rec <- recipe(~ x, data = dat) %>%
+   # add the transformation
+   step_BoxCox(x) %>%
+   # estimate lambda
+   prep(training = dat, retain = TRUE)
> # Now get the transformed value
> trans_dat <- juice(bc_rec)
> trans_dat
# A tibble: 100 x 1
 1  0.1451
 2 -1.0918
 3 -0.5308
 4  0.0245
 5  0.5588
 6 -0.1734
 7  1.2635
 8 -1.2698
 9  1.2440
10  0.3316
# ... with 90 more rows
> # lambda estimate:
> tidy(bc_rec, number = 1)
# A tibble: 1 x 2
  terms  value
  <chr>  <dbl>
1     x 0.0491