Does a back transformation function exist for data passed through the preProcess function from Caret?
No. It does not. I don't think that will be developed either.
Dr. Kuhn -- thanks for the expeditious response. Obviously I can write a function to backtransform center, scaling, Box-Cox, etc. My confusion is does the preProcess function tell us the ordering that the transformations were made? That seems to me to be key in attempting to backtransform.
It is a fixed order (unlike recipes). The docs have:
The operations are applied in this order: zero-variance filter, near-zero variance filter, correlation filter, Box-Cox/Yeo-Johnson/exponential transformation, centering, scaling, range, imputation, PCA, ICA then spatial sign. This is a departure from versions of caret prior to version 4.76 (where imputation was done first) and is not backwards compatible if bagging was used for imputation.
(I couldn't remember the order)
Thanks -- final question: do you have a recommendation on how we map predicted values in the transformed space back to the initial scale space? In other words, how do you handle this issue in your work?
I keep them transformed until I report the predicted values. I think that all of the metrics (and maybe plots)should be on the transformed scale.
I know I said 'final question' but I didn't get the answer I was seeking necessarily. Your response to this question can save me hours of frustration. Your statement "I keep them transformed until I report the predicted values." I'm at the stage of reporting predicted values and they're on the transformed scale -- how do I efficiently backtransform my predicted values to the original scale?
Some code would be helpful to avoid confusion. Are you using
preProcess() directly (and giving it the outcome column) or are you using it via the option to
train()? I assume the former since the latter only affects the predictors.
Either way, we don't have code to reverse transform the data. The approach that you would take depends on what pre-processing was done.
Yes, the former. I will give the primary code below. I am building a linear model predicting home prices from Redfin.com data (Y = price) and I pass my entire dataframe (named 'bham' through to preProcess):
'bham' is my data frame.
range(na.omit(bham$price)) [130880, 3500000] ## range of original outcome variable prior to preProcess
bham <- read.csv('redfin_2021-05-03-13-50-46.csv', header = TRUE) trans<-preProcess(bham, method = c("center", "scale", "BoxCox","spatialSign" )) transformed <- predict(trans, bham) m4 <- lm((price) ~ poly(square.feet, 3) + poly(year.built,2) + latitude + poly(longitude,3) , data = transformed) P<-predict(m4, newdata = newdata, type = 'response')##PREDICT OUTCOME VARIABLE## range(P) [-912424226099 , -487374100] ## range of transformed outcome variable##
So, this is where the trouble comes. I need to transform my predicted response back to original scale and I'm unsure of how to reverse the ordering from the order I passed to the method argument from preProcess above. preProcess doesn't seem to tell me which lambda was used for which variable. I also don't think a spatial-sign transformation needs a backtransform, correct?
I don't think that your usage is appropriate. For example, since you give it the entire data set, the spatial sign computation will have your predictor data being dependent on the outcome data (and vice-versa).
Can you describe what you are trying to do? If you are just interested in the Box-Cox transform of the outcome, I suggest using the
car packages. An example is below but take a look at this book for a longer discussion.
library(car) #> Loading required package: carData data(ames, package = "modeldata") ames <- ames[, c("Sale_Price", "Year_Built", "Latitude", "Longitude")] ames <- as.data.frame(ames) with( ames, boxCox(Sale_Price ~ Year_Built + Latitude + Longitude, data = ames) )
Created on 2021-05-07 by the reprex package (v22.214.171.12400)
Yes, I believe I now understand my error. preProcess() function should really only be used for feature (independent) variables in a predictive modeling project. Why? Because we're ultimately less interested in interpretability of feature variables in the interpretability-predictability tradeoff so there's actually no reason to backtransform feature variables passed through preProcess(). I'm gathering from your response that if my response variable needs transformation, then that transformation (Box-Cox for instance) should be done independently of the feature variables. By transforming the response with a simple Box-Cox function, then there are functions available to backtransform the response once predicted values are obtained -- no need to really worry about any type of backtransformation of feature variables in a predictive analytics project as our highest concern is predictive accuracy. I recently purchased "Applied Predictive Modeling" and I'm doing a self-study of the book. I think my understanding is now correct via my response.