Scaling only some columns of a training set and a test set

Andrea · February 8, 2018, 11:07am

Hi all,

I often have to deal with the following issue:

I have a test set and a training set
I want to scale only some columns of the training set
then, based on the sample means and sample standard deviations of the selected columns of the training set, I want to rescale the test set too

Currently, my workflow is kludgy: I use an index vector and then partial assignment to scale only some columns of the train set. I store the means and standard deviations from the scaling operation on the training set, and I use them to scale the test set. I was wondering if there could be a simpler way. Here is my current workflow:

# define dummy train and test sets
train <- data.frame(letters = LETTERS[1:10], months = month.abb[1:10], numbers = 1:10,
                    x = rnorm(10, 1), y = runif(10))
test <- train
test$x <- rnorm(10, 1)
test$y <- runif(10)

# names of variables I don't want to scale
varnames <- c("letters", "months", "numbers")

# index vector of columns which must not be scaled
index <- names(train) %in% varnames

# scale only the columns not in index
temp <- scale(train[, !index])
train[, !index] <- temp

# get the means and standard deviations from temp, to scale test too
means <- attr(temp, "scaled:center")
standard_deviations <- attr(temp, "scaled:center")

# scale test
test[, !index] <- scale(test[, !index], center = means, scale = standard_deviations)

Is there a simpler way to do this using the tidyverse?

danr · February 8, 2018, 5:05pm

Below is your code followed by a tidyverse implementation. Things like dplyr::select and dplry::mutate make a lot of common table operations easier. scale is not part of the tidyverse so there is some code to coerce its results into the tidyverse.

Also it would be helpful in the future if you used a reprex for your
code and showed the results that you got. Here is link to reprex help

http://reprex.tidyverse.org/articles/reprex.html

suppressPackageStartupMessages(library(tidyverse))
# your code followed by tidyverse implementation
# define dummy train and test sets

# make sequences repeatable so tidyverse code output can be compared
# some things are changed in your code to guarentee that same
# random sequences are used in the tidyverse example
set.seed(0)
n <- rnorm(10,1)
un <- runif(10)

train <- data.frame(letters = LETTERS[1:10],
                                        months = month.abb[1:10], numbers = 1:10,
                                        x = n, y = un)
test <- train
test$x <- rnorm(10, 1)
test$y <- runif(10)

# names of variables I don't want to scale
varnames <- c("letters", "months", "numbers")

# index vector of columns which must not be scaled
index <- names(train) %in% varnames

# scale only the columns not in index
temp <- scale(train[, !index])
train[, !index] <- temp

# get the means and standard deviations from temp, to scale test too
means <- attr(temp, "scaled:center")
standard_deviations <- attr(temp, "scaled:center")


# scale test
test[, !index] <- scale(test[, !index], center = means, scale = standard_deviations)
# tidyverse implementation
#
set.seed(0)
# make tibble not data.frame
n <- rnorm(10,1)
un <- runif(10)
train2 <- tibble(letters = LETTERS[1:10], months = month.abb[1:10], numbers = 1:10,
                                        x = n, y = un)
n <- rnorm(10,1)
un <- runif(10)
# mutate train to make test2
test2 <- dplyr::mutate(train2, x = n, y = un)

# yank scale results into the tidyverse
# select used to drop columns
scaled <- scale(select(train2, -letters, -months, -numbers))

means2 <- attr(scaled, "scaled:center")
standard_deviations2 <- attr(scaled, "scaled:center")

scaled <- as_tibble(scaled)

# mutate trai2 with scaled results
train2 <- mutate(test2, x = scaled$x, y = scaled$y)


st <- as_tibble(scale(select(test2, -letters, -months, -numbers),
                                                center = means2, scale = standard_deviations2))
# replace x and y with scaled results
test2 <- mutate(test2, x = st$x, y = st$y)

# test and tests two are not exactly the same
# structures so identical will not work
# one has attributes and the other does not
#
# check that both have the same values
purrr::map2_dbl(test2$x, test$x, ~ .x - .y)
#>  [1] 0 0 0 0 0 0 0 0 0 0
purrr::map2_dbl(test2$y, test$y, ~ .x - .y)
#>  [1] 0 0 0 0 0 0 0 0 0 0


str(test)
#> 'data.frame':    10 obs. of  5 variables:
#>  $ letters: Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
#>  $ months : Factor w/ 10 levels "Apr","Aug","Feb",..: 4 3 7 1 8 6 5 2 10 9
#>  $ numbers: int  1 2 3 4 5 6 7 8 9 10
#>  $ x      : num  -0.5669 -0.0785 -0.9205 0.0565 -1.1748 ...
#>  $ y      : num  0.4993 0.0337 0.864 -0.0518 -0.4702 ...
str(test2)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  5 variables:
#>  $ letters: chr  "A" "B" "C" "D" ...
#>  $ months : chr  "Jan" "Feb" "Mar" "Apr" ...
#>  $ numbers: int  1 2 3 4 5 6 7 8 9 10
#>  $ x      : atomic  -0.5669 -0.0785 -0.9205 0.0565 -1.1748 ...
#>   ..- attr(*, "scaled:center")= Named num  1.359 0.462
#>   .. ..- attr(*, "names")= chr  "x" "y"
#>   ..- attr(*, "scaled:scale")= Named num  1.359 0.462
#>   .. ..- attr(*, "names")= chr  "x" "y"
#>  $ y      : atomic  0.4993 0.0337 0.864 -0.0518 -0.4702 ...
#>   ..- attr(*, "scaled:center")= Named num  1.359 0.462
#>   .. ..- attr(*, "names")= chr  "x" "y"
#>   ..- attr(*, "scaled:scale")= Named num  1.359 0.462
#>   .. ..- attr(*, "names")= chr  "x" "y"

Andrea · February 9, 2018, 12:41pm

@danr thanks for the answer. My question included a complete test case, and you could easily run it in R. I don't think reprex would have made a significant difference here.

Your tidyverse solution is nice, but it's actually a bit more complicated than my bas R solution, since you need to name each column in mutate. In my real user case I have a lot of columns to mutate, so it would be tedious to name each of them. However, I think that can be fixed with something like this:

train2 <- mutate_if(test2, !index)

Thanks again,

Best Regards

Andrea

FlorianGD · February 11, 2018, 8:18pm

Hi,
I haven't used the recipes package but it seems like a good fit for what you're trying to do. You can specify selectors and then the pre-processing you want to apply.
You can have a look at some examples here:

https://topepo.github.io/recipes/articles/Simple_Example.html

Hope this helps

Florian

Andrea · February 11, 2018, 8:35pm

Sounds interesting! I'll have a look, thanks

Max · February 12, 2018, 12:06pm

Here is what that would look like:

> train <- data.frame(letters = LETTERS[1:10], months = month.abb[1:10], numbers = 1:10,
+                     x = rnorm(10, 1), y = runif(10))
> test <- train
> test$x <- rnorm(10, 1)
> test$y <- runif(10)
> 
> 
> library(recipes)
> 
> preproc <- recipe(~ ., data = train) %>%
+     step_center(-letters, -months, -numbers) %>%
+     step_scale(-letters, -months, -numbers)
> 
> # Estimate the values from the training set
> 
> preproc <- prep(preproc, training = train)
> 
> # Apply training set mean/sd to test set
> 
> bake(preproc, test)
# A tibble: 10 x 5
   letters months numbers      x      y
   <fctr>  <fctr>   <int>  <dbl>  <dbl>
 1 A       Jan          1 -2.71  -1.09 
 2 B       Feb          2 -3.77  -1.85 
 3 C       Mar          3 -1.31   1.10 
 4 D       Apr          4 -2.52   0.954
 5 E       May          5 -2.50  -1.59 
 6 F       Jun          6 -3.71   0.825
 7 G       Jul          7 -1.13  -0.985
 8 H       Aug          8 -1.06   0.249
 9 I       Sep          9 -0.982 -1.68 
10 J       Oct         10 -3.27  -0.841

Andrea · February 12, 2018, 2:33pm

Precisely what I was looking for! Simpler & more readable than my base R solution. Did you develop recipes to generalize data processing pipelines in caret? It seems the perfect way.

Max · February 12, 2018, 3:24pm

Did you develop recipes to generalize data processing pipelines in caret?

It is the tidy modeling analog to caret::preProcess. It does much much more than that function though.

train has an interface for it so that you can write a recipe and give it to train.

DavidK · October 12, 2018, 12:40pm

doesn't look right.

I think perhaps we actually want

(Pardon the trifling correction, just want to save the next reader a little time.)