Identify Label Columns?

In the software I have been using for deep machine learning models, I was able to mark them as label columns that would remove them from training, but supply them with the outputs. For example, a column with "name" or "id".

What is the equivalency of this concept in RStudio?

This is a bit too unspecific for us to help, could you supply us with an code example of what you are working on?

#Load the data
setwd("~/R/Projects/MyData")
df <- read.csv("MyData.csv", header = TRUE)
N <- nrow(df)
p <- which(colnames(df)=="Prediction")
X <- dummy.data.frame(df[, c(10:35)])
Y <- df[, p]

data = cbind(X, Y)

## split data, training & testing, 80:20, AND convert dataframe to a matrix
set.seed(777);
Ind = sample(N, N*0.8, replace = FALSE) 
p = ncol(data)
Y_train = data.matrix(data[Ind, p])
X_train  = data.matrix(data[Ind, -c(1:9)])

Y_test = data.matrix(data[-Ind, p])
X_test = data.matrix(data[-Ind, -c(1:9)])

k = ncol(X_train)

In this case, the first 9 columns are label rows. Thus, not involved in the training and testing.

I would like to include them with the output though. This one above might be more complex because there are 9 label rows, but below is a simple example of the output.

For example, instead of just: 25.6, 22.3, 24.1 I would want to see Car Model (label) and MPG (prediction). That way, I can write it to a csv file with all the labels and the predictions that the model made.

recipes can accommodate ID or other non-analysis variables in the data set:

library(tidymodels)
#> ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidymodels 0.0.2 ──
#> ✔ broom     0.5.1     ✔ purrr     0.3.0
#> ✔ dials     0.0.2     ✔ recipes   0.1.4
#> ✔ dplyr     0.7.8     ✔ rsample   0.0.4
#> ✔ ggplot2   3.1.0     ✔ tibble    2.0.1
#> ✔ infer     0.4.0     ✔ yardstick 0.0.2
#> ✔ parsnip   0.0.1
#> ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
library(stringr)
#> 
#> Attaching package: 'stringr'
#> The following object is masked from 'package:recipes':
#> 
#>     fixed

url <- "https://github.com/topepo/cars/raw/master/2018_12_02_city/car_data_splits.RData"
temp_save <- tempfile()
download.file(url, destfile = temp_save)
load(temp_save)

str(car_train)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    3596 obs. of  30 variables:
#>  $ cylinders                     : num  4 4 4 4 6 6 6 6 6 6 ...
#>  $ eng_displ                     : num  2 2.4 2.4 2.4 3.5 3.5 3.5 3.5 3.5 3.5 ...
#>  $ drive                         : Factor w/ 6 levels "AllWheel_Drive",..: 4 4 4 4 4 4 4 1 1 1 ...
#>  $ fuel_type                     : Factor w/ 13 levels "CNG","Diesel",..: 7 7 7 7 7 7 7 7 7 7 ...
#>  $ hatch_lug_vol                 : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ hatch_pas_vol                 : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ two_door_lug_vol              : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ four_door_lug_vol             : int  12 12 12 12 0 0 0 0 0 0 ...
#>  $ make                          : Factor w/ 51 levels "Acura","Alfa_Romeo",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ model                         : Factor w/ 1454 levels "124 Spider","1500 2WD",..: 741 741 741 741 864 864 864 865 865 865 ...
#>  $ two_door_pass_vol             : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ four_door_pass_vol            : int  89 89 89 89 0 0 0 0 0 0 ...
#>  $ transmission                  : Factor w/ 31 levels "Automatic_10spd",..: 23 12 15 30 24 27 27 24 27 27 ...
#>  $ car_class                     : Factor w/ 22 levels "Compact_Cars",..: 1 1 1 1 10 10 10 11 11 11 ...
#>  $ year                          : int  2015 2016 2017 2015 2015 2016 2016 2015 2016 2016 ...
#>  $ start_stop                    : num  0 0 0 0 0 0 1 0 0 1 ...
#>  $ spark_ignited_direct_injection: num  0 1 1 0 1 1 1 1 1 1 ...
#>  $ flexible_fuel                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ plug_in_hybrid                : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ hybrid                        : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ port_fuel_injection           : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ hellcat                       : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ ieloop                        : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ partial_zero_emissions        : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ eco                           : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ z06                           : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ zr1                           : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ turbo_charged                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ super_charged                 : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ mpg                           : num  24.2 25.1 25.1 21.8 19.8 ...

# `model` is an ID variable

basic_rec <- 
  recipe(mpg ~ ., data = car_train) %>% 
  # mpg is the outcome and all others are predictors
  # now redefine `model` as the ID variable
  update_role(model, new_role = "ID") %>%
  # Make model a combo of model and year
  step_mutate(model = str_c(year, model, sep = " ")) %>% 
  # Do other stuff like preprocessing.
  # Create dummy variables but not for `model` column
  step_dummy(all_predictors(), -all_numeric()) %>% 
  prep(car_train)

# `model` not a predictor
juice(basic_rec, all_predictors()) %>% 
  names() %>% 
  str_detect("model") %>% 
  any()
#> [1] FALSE

juice(basic_rec, has_role("ID"))
#> # A tibble: 3,596 x 1
#>    model       
#>    <fct>       
#>  1 2015 ILX    
#>  2 2016 ILX    
#>  3 2017 ILX    
#>  4 2015 ILX    
#>  5 2015 MDX 2WD
#>  6 2016 MDX 2WD
#>  7 2016 MDX 2WD
#>  8 2015 MDX 4WD
#>  9 2016 MDX 4WD
#> 10 2016 MDX 4WD
#> # … with 3,586 more rows

Created on 2019-02-24 by the reprex package (v0.2.1)

This is extremely confusing.

All I want is the columns not used in training because they have no prediction value to show up with my prediction results.

So if I train with 26 variables and have 9 columns not being used, I would like the 9 columns, which are unique identifiers of the data to be tied with the prediction value.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.