Need R Normalization for numeric and categoric variables

andresmanu · September 2, 2019, 2:32pm

I need to find some functions: { magic_function_1, magic_function_2 } or similar to achieve what is described below.

Suposing we have this helper function:

my.normalize = function(vec){
  if (is.numeric(vec)) {
    vec = (vec - min(vec)) / (max(vec) - min(vec))
  }
  return (vec)
}

Initial dataset:

ds_1 = data.frame(
  score = c(142, 89, 540, 38, 232, 142),
  age = c(20, 18, 76, 54, 15, 22),
  points = c(4, 50, 100, 10, 9, 35),
  group = c("A", "B", "A", "A", "C", "B"),
  favoritedrink = c("Coke", "Water", "Water", "Wine", "Tea", "Coke"),
  type = c("1", "2", "3", "1", "2", "1")
)
ds_1

##   score age points group favoritedrink type
## 1   142  20      4     A          Coke    1
## 2    89  18     50     B         Water    2
## 3   540  76    100     A         Water    3
## 4    38  54     10     A          Wine    1
## 5   232  15      9     C           Tea    2
## 6   142  22     35     B          Coke    1

What I want to simulate:

ds_2 = mutate_if(ds_1, is.numeric, my.normalize)
ds_3 = data.frame(model.matrix(~ score + age + points + group + favoritedrink + type, data = ds_2))[, -1]
ds_3

##       score        age     points groupB groupC favoritedrinkTea
## 1 0.2071713 0.08196721 0.00000000      0      0                0
## 2 0.1015936 0.04918033 0.47916667      1      0                0
## 3 1.0000000 1.00000000 1.00000000      0      0                0
## 4 0.0000000 0.63934426 0.06250000      0      0                0
## 5 0.3864542 0.00000000 0.05208333      0      1                1
## 6 0.2071713 0.11475410 0.32291667      1      0                0
##   favoritedrinkWater favoritedrinkWine type2 type3
## 1                  0                 0     0     0
## 2                  1                 0     1     0
## 3                  1                 0     0     1
## 4                  0                 1     0     0
## 5                  0                 0     1     0
## 6                  0                 0     0     0

For example, I'm looking some magic function: magic_function_1 to achieve the following:

ds_3 = magic_function_1(ds_1)
# where that magic function also saves the following config:
ds_3.config = [saved config to convert future values with same parameters]

where ds_3 should be the same table/output as shown before and ds_3.config is the configuration that made possible that transformation. This configuration could be used later on to do transformations keeping the same scales / parameters / etc. For example, inside that config could be stored the min/max values of the numeric variables, or the possible values of the categorical variables, etc.

Then ...

If in the future, if I have the following input:

input = ds_1[5,]
rownames(input) = NULL # just resetting the row indexes
input

which was on the initial table, then we get the following:

out_1 = magic_function_2(input, ds_3.config)

all(out_1 == ds_3[5,]) == TRUE # in other words: out_1 should be equals to ds_3[5,] which is the corresponding row after normalization

Also, when using any other input that was not necessary included on ds_1, for example:

input = data.frame(
  score = 100,
  age = 16,
  points = 73,
  group = "C",
  favoritedrink = "Water",
  type = "2"
)

when we call:

out_2 = magic_function_2(input, ds_3.config)

then, on out_2 the numeric values should be scaled properly according to ds_3.config and the categorical values should be tranformed accordingly (as you can see on the second table above).

In the other hand, if we pass some categorical value that was not on the original dataset ds_1, for example:

input = data.frame(
  score = 100,
  age = 16,
  points = 73,
  group = "C",
  favoritedrink = "Rum",
  type = "2"
)

when we call:

out_3 = magic_function_2(input, ds_3.config)

then, we should get an error because Rum was not on the initial dataset.

Max · September 6, 2019, 11:18pm

Easy peasy:

library(tidymodels)
#> ── Attaching packages ──────────────────────────────────────────────────────────────── tidymodels 0.0.2 ──
#> ✔ broom     0.5.2       ✔ purrr     0.3.2  
#> ✔ dials     0.0.2       ✔ recipes   0.1.6  
#> ✔ dplyr     0.8.3       ✔ rsample   0.0.5  
#> ✔ ggplot2   3.2.1       ✔ tibble    2.1.3  
#> ✔ infer     0.4.0.1     ✔ yardstick 0.0.4  
#> ✔ parsnip   0.0.3.1
#> ── Conflicts ─────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
options(width = 120)

ds_1 = data.frame(
  score = c(142, 89, 540, 38, 232, 142),
  age = c(20, 18, 76, 54, 15, 22),
  points = c(4, 50, 100, 10, 9, 35),
  group = c("A", "B", "A", "A", "C", "B"),
  favoritedrink = c("Coke", "Water", "Water", "Wine", "Tea", "Coke"),
  type = c("1", "2", "3", "1", "2", "1")
)

rec <- 
  recipe(~ score + age + points + group + favoritedrink + type, data = ds_1) %>% 
  step_range(all_numeric()) %>% 
  step_dummy(all_nominal()) %>% 
  prep()

juice(rec)
#> # A tibble: 6 x 10
#>   score    age points group_B group_C favoritedrink_Tea favoritedrink_Water favoritedrink_Wine type_X2 type_X3
#>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>             <dbl>               <dbl>              <dbl>   <dbl>   <dbl>
#> 1 0.207 0.0820 0            0       0                 0                   0                  0       0       0
#> 2 0.102 0.0492 0.479        1       0                 0                   1                  0       1       0
#> 3 1     1      1            0       0                 0                   1                  0       0       1
#> 4 0     0.639  0.0625       0       0                 0                   0                  1       0       0
#> 5 0.386 0      0.0521       0       1                 1                   0                  0       1       0
#> 6 0.207 0.115  0.323        1       0                 0                   0                  0       0       0
  
input = data.frame(
  score = 100,
  age = 16,
  points = 73,
  group = "C",
  favoritedrink = "Water",
  type = "2"
)

bake(rec, input)
#> # A tibble: 1 x 10
#>   score    age points group_B group_C favoritedrink_Tea favoritedrink_Water favoritedrink_Wine type_X2 type_X3
#>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>             <dbl>               <dbl>              <dbl>   <dbl>   <dbl>
#> 1 0.124 0.0164  0.719       0       1                 0                   1                  0       1       0

^{Created on 2019-09-06 by the reprex package (v0.2.1)}

system · September 27, 2019, 11:18pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.