Non-linear Partial Least Squares for dimensionality reduction

jicomeqa · December 25, 2018, 12:40pm

What packages can I use for dimensionality reduction using PLS when there are
non-linear relationships?

I found this paper (https://pdfs.semanticscholar.org/aa15/11d22324ccb8f117dbc1c5383fa95bc1a51c.pdf)

but no mention of R packages.

Thank you.

technocrat · December 25, 2018, 7:38pm

Check out the pls package See (https://cran.r-project.org/web/packages/pls/vignettes/pls-manual.pdf) It uses latent variables derived from PCA.

jicomeqa · December 26, 2018, 12:40pm

Sorry, but the pls package doesn't have what I am looking for. That package is seems to be for linearly-related variables, not for non-linear ones.

technocrat · December 26, 2018, 4:10pm

@jlcomega, I believe you are mistaken. The entire point of teasing out latent variables is to deal with high-n data that does not yield results to linear analysis, and that is what pls does with PCA and its other techniques.

alexpghayes · December 28, 2018, 7:55am

I haven't seen any non-linear partial least squares packages in R. The paper mentions that no software was aware at the time of writing:

At the time of writing, I was not aware of a comprehensive software covering all of the described nonlinear PLS methods. A set of Matlab routines for kernel PLS is available upon request.

If you really need non-linear PLS you may be stuck implementing it yourself, perhaps using the MATLAB code as a reference.

Can you describe your use case? My guess is that the least painful option is shoehorning the problem into some kind of GAM framework that mgcv can fit for you.

jicomeqa · December 28, 2018, 4:25pm

My background is not in math or statistics but I had a bit of suspicion that non-linear PLS is not really discussed or looked into by the statistics community even after an extensive Google search.

My use case:
I have about 30-40 variables that potentially has an effect on a process outcome (response variable). I also know that time and a few other variables that are correlated to time that affect my response. Some of these correlations are not linear (and they should never be linear, for example cell growth), instead they can be described by logistic, sigmoidal or negative exponential functions of time. I could manually transform each of these but (1) this can be time-consuming, (2) prone to error and personal bias and (3) I only know the exact relationship in some, not all cases.

My understanding of PLS is that if you have a set of linearly-correlated variables, you can simplify the model to a handful of variables. I don't think my response is really dependent on 30 variables but I am having a huge problem talking to my statistician colleagues who don't seem to grasp the problem while my biologist colleagues think of PLS as a magic tool to figure out which variables are important for the response ('throw everything into PLS and you'll get the answer').

When I looked at a scatter plot matrix of the variables and calculate the linear correlations, the R2 values are quite low but some of the plots are so obvious to be non-linear. Hence, my questioning of my colleagues' approach. In fact, most of the multivariate analysis that we do either fail to identify the predictor variables or only explain about 40 % of the response variance.

In this particular case, I was able to manually figure out the real contributors to my response and make a linear model containing just a few variables (not a great model though). The dataset was also small with about 30 rows/data points - usually I am left with either splitting the dataset to 70% training and 30% test or do a k-fold validation. On one hand, you could say I was cherry-picking my variables. On the other hand, how should I do this in a more 'statistical' manner?

Sorry for a long reply, more of a rant

technocrat · December 28, 2018, 9:25pm

With high-dimensional data, you're right, you need a way to find your way through n-space to planes not lines. There's a dimension reduction technique called principal component analysis that helps in doing this. See § 70.4 of the Harvard biostatistics department methods instruction course. The run up is more theoretical, but the illustration of PCA will give you an idea of its power. A good text is Exploratory Multivariate Analysis Using R, 2nd ed. by Husson, Lê and Pagè http://factominer.free.fr/bookV2/

alexpghayes · December 28, 2018, 11:52pm

Okay. I would use a Generalized Additive Model to model the non-linear relationships between the features and the response (I'd use the mgcv package). As a starting point, I'd throw everything into the model, and then look plot the individual smooths from the GAM. I believe GAMs can penalize individual terms out of the model, but don't know quite how this works. In any case, Simon Wood's book is a canonical reference.

Some starter R code:

library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-23. For overview type 'help("mgcv-package")'.

fit <- gam(
  mpg ~ s(wt) + s(disp),
  data = mtcars,
  select = TRUE           # allow penalizing a feature out of the model
)

plot(fit)

^{Created on 2018-12-28 by the reprex package (v0.2.1)}

Another option would be to combine a non-linear modeling technique with some sort of feature selection. My first thought would be to use the group LASSO with natural splines. Some starter R code:

library(grpreg)
#> Warning: package 'grpreg' was built under R version 3.5.1
library(splines)

formula <- mpg ~ ns(wt, 4) + ns(disp, 4)

x <- model.matrix(formula, mtcars)
y <- mtcars$mpg
groups <- attr(x, "assign")

fit <- cv.grpreg(x, y, groups)

summary(fit)
#> grLasso-penalized linear regression with n=32, p=9
#> At minimum cross-validation error (lambda=0.2968):
#> -------------------------------------------------
#>   Nonzero coefficients: 8
#>   Nonzero groups: 2
#>   Cross-validation error of 7.17
#>   Maximum R-squared: 0.80
#>   Maximum signal-to-noise ratio: 3.91
#>   Scale estimate (sigma) at lambda.min: 2.677

^{Created on 2018-12-28 by the reprex package (v0.2.1)}

system · January 18, 2019, 11:52pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.