Balance of predictor variables in partial least squares regression



I am using partial least squares regression (PLS) to model the relative effects of soil and weather variables on the magnitude of an annual phenomenon, nitrous oxide emissions. I am doing this on an annual basis across many sites.

Soils at each site are different, but soils within each site stay the same every year.

Weather is different at every site and different from year to year.

So, I have 56 responses (nitrous oxide), 56 corresponding weather predictors (one for each year), but only 10 soil predictors (one for each soil type).

I am using the pls package in R. I think I did an okay job pre-processing my data (BoxCox, center, scale). I end up with 17 columns of soil properties (the properties repeated for each year, not unique rows), 5 columns of weather variables (each row unique) and my dependent variable nitrous oxide (each row unique).

I just need one line of code:

plsFit<-plsr(nitrous_oxide ~ ., validation = "LOO", data=all_data)

Everything runs great, but I can't help but feel the lack of balance.

Should I really be treating soil and weather variables in the same way?

Applied Predictive Modeling has been extremely helpful in trying to do what I want to do, but boy do I have a lot of questions specific to my dataset.



I put this question on Cross Validated 2 weeks ago and only got 14 views and no answers.

My in-person stats sources are not familiar with PLS (but if I would just switch to PCA, they've got a bunch of advice).

Let me know if I can make this question clearer. I've put the code+data in a repo here.


I'm finally getting a chance to look at this.

I don't see much wrong with giving the sets of predictors equal (initial) weight in the regression. The fitting process will determine how much each should be involved. There are also sparse PLS (think lasso with PLS) that might be worth looking into. These would remove any predictors that were not influential in the model (as opposed to giving them relatively small coefficients).

I will also try to give you a slightly different workflow that would re-estimate the transformations within resample (just to see if that matters).

EDIT - removed the dumb question


Here's a tidy workflow that I used to see if estimating the transformations inside of resampling made a big difference (it didn't):


# create the recipe without the formula to more easily 
pls_rec <- recipe(total_N2O ~ ., data = annual_flux) %>%
  add_role(series, town, avg_N2O, log_avg_N2O,year, log_total_N2O,
           new_role = "other data") %>%
  step_YeoJohnson(all_predictors()) %>%
  step_center(all_predictors()) %>%

# functions to fit the model and get predictions + RMSE
pls_fit <- function(recipe, ...) {
  plsr(total_N2O ~ ., data = juice(recipe, all_predictors(), all_outcomes()))
pls_pred <- function(split, recipe, model, ...) {
  # get the holdout data via `assessment` then preprocess
  pred_data <- bake(recipe, newdata = assessment(split), all_predictors(), all_outcomes())

  # One outcome so drop the dimension to a 2d array
  pred_vals <- predict(model, newdata = pred_data)[,1,] %>% %>%
    bind_cols(assessment(split) %>% select(total_N2O)) %>%
    # drop the predictions per component into a long df
    gather(label, prediction, -total_N2O) %>%
    # convert "X comp" to a numeric X
      ncomp = gsub(" comps", "", label),
      ncomp = as.numeric(ncomp)
      ) %>%
    # add resampling labels

# setup a resampling scheme, estimate the recipes, fit the models and 
# make predictions, 
folds <- vfold_cv(annual_flux, repeats = 5) %>%
    recipes = map(splits, prepper, pls_rec, retain = TRUE),
    models = map(recipes, pls_fit, validation = "none"),
    pred = pmap(
      list(split = splits, recipe = recipes, model = models),

# get the RMSE values, summarize by #comps and plot

pred_vals <- folds %>% 
  pull(pred) %>% 

performance <- pred_vals %>%
  group_by(ncomp, id, id2) %>%
  do(metrics(., truth = total_N2O, estimate = prediction))

av_performance <- performance %>% 
  group_by(ncomp) %>%
  summarize(rmse = mean(rmse), rsq = mean(rsq))

ggplot(av_performance, aes(x = ncomp, y = rmse)) + 
  geom_point() + 
  geom_line() + 

We can compare this to what comes out of plsr (very similar pattern) via

# compare to orignal objects
original_pls_model <- 
    ncomp = 0:26,
    rmse = unname(RMSEP(plsFit)$val[1,1,]),
    rsq = unname(R2(plsFit)$val[1,1,])
  ) %>%
  filter(ncomp > 0)

# pretty consistent with the original model; values are different due 
# to different resamples (and more too)
ggplot(av_performance, aes(x = ncomp, y = rmse)) + 
  geom_point() + 
  geom_line() + 
  geom_point(data = original_pls_model, col = "red") + 
  geom_line(data = original_pls_model, col = "red") + 


Thanks! This was totally worth the wait.

There is a similar model out there making N2O predictions with only weather variables. I wanted to strengthen the model by adding soil variables. But when I rank the predictors using varImp, the weather variables fall to last with no importance. It's not good news for the other person's model, so I want to be extra careful!

Another set of eyes and the extra code is really appreciated!