Questions on PCR with regard to Output

loiacj · January 11, 2020, 12:21am

I'm working on an analytics project that involves using PCR. I'd like to know which aspects to pull out from the model (into data frames) in order to apply it to outside data in other programs (excel). I'm not sure how to take the current output and how to apply it. Will post my code below using the iris dataset as an example.

#installing package
install.packages("pls")
library(pls)

#importing iris dataset, dropping categorical variable
#for example, using Sepal.Length
test <- iris[,c(1:4)]
model <- pcr(Sepal.Length ~ ., data = iris, scale = TRUE, validation = "CV")

#code for highlighting one possible optimal number of components
numbercomponents <- selectNcomp(model, method = c("randomization", "onesigma"),
                                nperm = 999, alpha = 0.01, ncomp = model$ncomp,
                                plot = FALSE)

technocrat · January 11, 2020, 2:02am

The model object has quite a bit under the hood:

library(pls)
#> 
#> Attaching package: 'pls'
#> The following object is masked from 'package:stats':
#> 
#>     loadings

test <- iris[,c(1:4)]
model <- pcr(Sepal.Length ~ ., data = iris, scale = TRUE, validation = "CV")

summary(model)
#> Data:    X dimension: 150 5 
#>  Y dimension: 150 1
#> Fit method: svdpc
#> Number of components considered: 5
#> 
#> VALIDATION: RMSEP
#> Cross-validated using 10 random segments.
#>        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps
#> CV          0.8308   0.5114   0.5067   0.3935   0.3288   0.3137
#> adjCV       0.8308   0.5110   0.5062   0.3930   0.3284   0.3130
#> 
#> TRAINING: % variance explained
#>               1 comps  2 comps  3 comps  4 comps  5 comps
#> X               56.20    88.62    99.07    99.73   100.00
#> Sepal.Length    62.71    63.58    78.44    84.95    86.73
str(model)
#> List of 19
#>  $ coefficients : num [1:5, 1, 1:5] -0.118 0.226 0.226 0.018 0.19 ...
#>   ..- attr(*, "dimnames")=List of 3
#>   .. ..$ : chr [1:5] "Sepal.Width" "Petal.Length" "Petal.Width" "Speciesversicolor" ...
#>   .. ..$ : chr "Sepal.Length"
#>   .. ..$ : chr [1:5] "1 comps" "2 comps" "3 comps" "4 comps" ...
#>  $ scores       : 'scores' num [1:150, 1:5] -2.21 -1.87 -2.04 -1.9 -2.28 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:150] "1" "2" "3" "4" ...
#>   .. ..$ : chr [1:5] "Comp 1" "Comp 2" "Comp 3" "Comp 4" ...
#>  $ loadings     : 'loadings' num [1:5, 1:5] -0.3024 0.5785 0.5788 0.0459 0.4866 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:5] "Sepal.Width" "Petal.Length" "Petal.Width" "Speciesversicolor" ...
#>   .. ..$ : chr [1:5] "Comp 1" "Comp 2" "Comp 3" "Comp 4" ...
#>  $ Yloadings    : 'loadings' num [1, 1:5] 0.3912 -0.0608 -0.4415 1.1631 0.9457
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr "Sepal.Length"
#>   .. ..$ : chr [1:5] "Comp 1" "Comp 2" "Comp 3" "Comp 4" ...
#>  $ projection   : num [1:5, 1:5] -0.3024 0.5785 0.5788 0.0459 0.4866 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:5] "Sepal.Width" "Petal.Length" "Petal.Width" "Speciesversicolor" ...
#>   .. ..$ : chr [1:5] "Comp 1" "Comp 2" "Comp 3" "Comp 4" ...
#>  $ Xmeans       : Named num [1:5] 7.014 2.129 1.573 0.705 0.705
#>   ..- attr(*, "names")= chr [1:5] "Sepal.Width" "Petal.Length" "Petal.Width" "Speciesversicolor" ...
#>  $ Ymeans       : Named num 5.84
#>   ..- attr(*, "names")= chr "Sepal.Length"
#>  $ fitted.values: num [1:150, 1, 1:5] 4.98 5.11 5.05 5.1 4.95 ...
#>   ..- attr(*, "dimnames")=List of 3
#>   .. ..$ : chr [1:150] "1" "2" "3" "4" ...
#>   .. ..$ : chr "Sepal.Length"
#>   .. ..$ : chr [1:5] "1 comps" "2 comps" "3 comps" "4 comps" ...
#>  $ residuals    : num [1:150, 1, 1:5] 0.1228 -0.2129 -0.3458 -0.4986 0.0499 ...
#>   ..- attr(*, "dimnames")=List of 3
#>   .. ..$ : chr [1:150] "1" "2" "3" "4" ...
#>   .. ..$ : chr "Sepal.Length"
#>   .. ..$ : chr [1:5] "1 comps" "2 comps" "3 comps" "4 comps" ...
#>  $ Xvar         : Named num [1:5] 418.65 241.53 77.86 4.91 2.04
#>   ..- attr(*, "names")= chr [1:5] "Comp 1" "Comp 2" "Comp 3" "Comp 4" ...
#>  $ Xtotvar      : num 745
#>  $ fit.time     : Named num 0
#>   ..- attr(*, "names")= chr "elapsed"
#>  $ ncomp        : num 5
#>  $ method       : chr "svdpc"
#>  $ scale        : Named num [1:5] 0.436 1.765 0.762 0.473 0.473
#>   ..- attr(*, "names")= chr [1:5] "Sepal.Width" "Petal.Length" "Petal.Width" "Speciesversicolor" ...
#>  $ validation   :List of 9
#>   ..$ method      : chr "CV"
#>   ..$ pred        : num [1:150, 1, 1:5] 4.99 5.11 5.02 5.12 4.96 ...
#>   .. ..- attr(*, "dimnames")=List of 3
#>   .. .. ..$ : chr [1:150] "1" "2" "3" "4" ...
#>   .. .. ..$ : chr "Sepal.Length"
#>   .. .. ..$ : chr [1:5] "1 comps" "2 comps" "3 comps" "4 comps" ...
#>   ..$ coefficients: NULL
#>   ..$ gammas      : NULL
#>   ..$ PRESS0      : Named num 104
#>   .. ..- attr(*, "names")= chr "Sepal.Length"
#>   ..$ PRESS       : num [1, 1:5] 39.2 38.5 23.2 16.2 14.8
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr "Sepal.Length"
#>   .. .. ..$ : chr [1:5] "1 comps" "2 comps" "3 comps" "4 comps" ...
#>   ..$ adj         : num [1, 1:5] 0.2544 0.2485 0.1473 0.1028 0.0908
#>   .. ..- attr(*, "dimnames")=List of 2
#>   .. .. ..$ : chr "Sepal.Length"
#>   .. .. ..$ : chr [1:5] "1 comps" "2 comps" "3 comps" "4 comps" ...
#>   ..$ segments    :List of 10
#>   .. ..$ V1 : int [1:15] 18 128 8 93 76 85 113 129 112 60 ...
#>   .. ..$ V2 : int [1:15] 59 26 149 16 41 33 75 38 19 114 ...
#>   .. ..$ V3 : int [1:15] 90 101 15 73 143 147 25 138 70 111 ...
#>   .. ..$ V4 : int [1:15] 14 40 146 11 69 42 115 148 51 30 ...
#>   .. ..$ V5 : int [1:15] 122 65 53 28 102 3 57 17 86 131 ...
#>   .. ..$ V6 : int [1:15] 106 56 105 79 120 133 9 71 144 123 ...
#>   .. ..$ V7 : int [1:15] 35 119 107 10 45 44 127 134 100 37 ...
#>   .. ..$ V8 : int [1:15] 92 23 132 2 34 109 150 104 139 62 ...
#>   .. ..$ V9 : int [1:15] 22 137 27 124 7 84 87 74 96 140 ...
#>   .. ..$ V10: int [1:15] 6 136 31 89 121 46 13 64 78 116 ...
#>   .. ..- attr(*, "incomplete")= num 0
#>   .. ..- attr(*, "type")= chr "random"
#>   ..$ ncomp       : num 5
#>  $ call         : language pcr(formula = Sepal.Length ~ ., data = iris, scale = TRUE, validation = "CV")
#>  $ terms        :Classes 'terms', 'formula'  language Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
#>   .. ..- attr(*, "variables")= language list(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)
#>   .. ..- attr(*, "factors")= int [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ...
#>   .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. ..$ : chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
#>   .. .. .. ..$ : chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
#>   .. ..- attr(*, "term.labels")= chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
#>   .. ..- attr(*, "order")= int [1:4] 1 1 1 1
#>   .. ..- attr(*, "intercept")= int 1
#>   .. ..- attr(*, "response")= int 1
#>   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>   .. ..- attr(*, "predvars")= language list(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)
#>   .. ..- attr(*, "dataClasses")= Named chr [1:5] "numeric" "numeric" "numeric" "numeric" ...
#>   .. .. ..- attr(*, "names")= chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
#>  $ model        :'data.frame':   150 obs. of  5 variables:
#>   ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>   ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>   ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>   ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#>   ..- attr(*, "terms")=Classes 'terms', 'formula'  language Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species
#>   .. .. ..- attr(*, "variables")= language list(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)
#>   .. .. ..- attr(*, "factors")= int [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ...
#>   .. .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. .. ..$ : chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
#>   .. .. .. .. ..$ : chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
#>   .. .. ..- attr(*, "term.labels")= chr [1:4] "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
#>   .. .. ..- attr(*, "order")= int [1:4] 1 1 1 1
#>   .. .. ..- attr(*, "intercept")= int 1
#>   .. .. ..- attr(*, "response")= int 1
#>   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>   .. .. ..- attr(*, "predvars")= language list(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)
#>   .. .. ..- attr(*, "dataClasses")= Named chr [1:5] "numeric" "numeric" "numeric" "numeric" ...
#>   .. .. .. ..- attr(*, "names")= chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
#>  - attr(*, "class")= chr "mvr"

^{Created on 2020-01-10 by the reprex package (v0.3.0)}

loiacj · January 11, 2020, 2:03am

Hi technocrat,

I know the model object has a bit under the hood, I'm just confused as to which parts to pull and how to apply it to the general data outside of the pcr function

technocrat · January 11, 2020, 2:12am

What to do with it in Excel is something I can't really help you with without knowing a lot more about your objectives.

As far as extracting info, though,

model$scores
model$coefficients

and so forth will extract the first level elements. If you need to extract deeper pieces, as from $model$validation,

library(pls)
#> 
#> Attaching package: 'pls'
#> The following object is masked from 'package:stats':
#> 
#>     loadings

test <- iris[,c(1:4)]
model <- pcr(Sepal.Length ~ ., data = iris, scale = TRUE, validation = "CV")

model$validation$segments
#> $V1
#>  [1]  49  96  12   4 102 135 142  43  63 136  76   2  26  47  99
#> 
#> $V2
#>  [1]  17  19 113 108 129  42 132  60  44 137  98  11  46  52 110
#> 
#> $V3
#>  [1] 143 144 133   5 146  65 130  22 149  40   6  54  61  85  21
#> 
#> $V4
#>  [1]  93  74 126  10 114  86 150  70  29  80 122  13  27  15  67
#> 
#> $V5
#>  [1]  95  18 141  81  33 125  35  88  62 106  30 148 117  32  14
#> 
#> $V6
#>  [1]   9  16  84  57  31   7  72  36  53 120  82  97 107 138  75
#> 
#> $V7
#>  [1]  58  68 111  41  83  90 127 131  51 123  89  79   3  59   1
#> 
#> $V8
#>  [1]  24  45  20  48  39  91  94 121   8 124  77 112  55 116  64
#> 
#> $V9
#>  [1]  34 100 134  56 118 109  25  50  92 103  78 147 115 140  37
#> 
#> $V10
#>  [1] 119 105  28 145  38 104  87  23 101  73 139  66 128  69  71
#> 
#> attr(,"incomplete")
#> [1] 0
#> attr(,"type")
#> [1] "random"

^{Created on 2020-01-10 by the reprex package (v0.3.0)}

loiacj · January 11, 2020, 2:24am

Hello,

Thank you for the insights. It was to my understanding I'd be able to run PCR to generate components which would be new factors for the regression instead of the original independent variables. I was just confused on which output from the model were the ones to be using.

Is it scores or coefficients or something else I'm supposed to be multiplying against the current independent variables (ex. Sepal.Width, Petal.Width, Petal.Length) to calculate the components for the regression?

Every guide I've looked up uses the predict function with the model developed in R, but I want to be able to take that formula and put it elsewhere, which I haven't been able to find an example of.

Basically I want to know how to reproduce the prediction results from the output of the model, so I can replicate it in excel in the tool I'm making

technocrat · January 11, 2020, 2:58am

The pls package has a prediction function that you will need to understand, and then take its output for export to other platforms

library(pls)
#> 
#> Attaching package: 'pls'
#> The following object is masked from 'package:stats':
#> 
#>     loadings
data(yarn)
nir.mvr <- mvr(density ~ NIR, ncomp = 5, data = yarn[yarn$train,])

## Predicted responses for models with 1, 2, 3 and 4 components
pred.resp <- predict(nir.mvr, ncomp = 1:4, newdata = yarn[!yarn$train,])

## Predicted responses for a single model with components 1, 2, 3, 4
predict(nir.mvr, comps = 1:4, newdata = yarn[!yarn$train,])
#>      density
#> 110 51.04992
#> 22  50.72019
#> 31  32.01454
#> 41  34.29076
#> 51  30.35994
#> 61  20.57832
#> 71  19.07786

## Predicted scores
predict(nir.mvr, comps = 1:3, type = "scores", newdata = yarn[!yarn$train,])
#>          Comp 1     Comp 2     Comp 3
#> 110  1.54411104  0.6112014 -0.2783797
#> 22   1.54887982 -0.3204970 -0.1990664
#> 31   0.03391126  1.4202763 -0.3941334
#> 41   0.30033040  0.2952664 -0.3835675
#> 51  -0.06780681 -1.1572169 -0.1909823
#> 61  -1.04444815  1.3565774 -0.1926097
#> 71  -0.90185306 -0.4988215 -0.3715940

^{Created on 2020-01-10 by the reprex package (v0.3.0)}

loiacj · January 11, 2020, 3:13am

I'm trying to avoid using the predict function.

I want to take the output from the PCR and calculate the components on my own with all their aspects so I can apply it to data on a different platform such as excel.

The closest thing I can relate it to is running a regression, looking at the coefficients, and then copying and pasting them elsewhere and using them with new data outside of R itself. I just need to know how to apply the PCR output in that similar format. I apologize for any confusion, this has been stumping me all day and I'm not too sure of most of it myself except for what the end result should be

technocrat · January 11, 2020, 4:31am

@loiacj, questions are absolutely the hardest part of data analysis.

There's a more direct way to get principal components than from the output of a pcr model,

PCs <- prcomp(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris, scale = TRUE)
PCs
#> Standard deviations (1, .., p=4):
#> [1] 1.7083611 0.9560494 0.3830886 0.1439265
#> 
#> Rotation (n x k) = (4 x 4):
#>                     PC1         PC2        PC3        PC4
#> Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
#> Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
#> Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
#> Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971
summary(PCs)
#> Importance of components:
#>                           PC1    PC2     PC3     PC4
#> Standard deviation     1.7084 0.9560 0.38309 0.14393
#> Proportion of Variance 0.7296 0.2285 0.03669 0.00518
#> Cumulative Proportion  0.7296 0.9581 0.99482 1.00000
plot(PCs)

biplot(PCs)

^{Created on 2020-01-10 by the reprex package (v0.3.0)}

The first plot demonstrates how principal components differ in their variance. The second is worth pausing over. It's a plot of all the iris data of the first (PC1) against the second (PC2) component showing vectors of the of the variables projected. This is key to understanding principal components--they are orthogonal. So, now we've gone from 2-space to n-space.

So, how to use principal components in a linear regression model? The only left-over variable, species, is categorical, and that would push us over to glm methods. So let's rerun and reserve Sepal.Length as the dependent variable.

PCs <- prcomp(~ Sepal.Width + Petal.Length + Petal.Width, data = iris, scale = TRUE)
y <- iris$Sepal.Width
fit <- lm(y~PCs$x[,1]+PCs$x[,2])
summary(fit)
#> 
#> Call:
#> lm(formula = y ~ PCs$x[, 1] + PCs$x[, 2])
#> 
#> Residuals:
#>        Min         1Q     Median         3Q        Max 
#> -0.0139056 -0.0026317  0.0000775  0.0025267  0.0121509 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  3.0573333  0.0003664  8345.3   <2e-16 ***
#> PCs$x[, 1]  -0.1822434  0.0002466  -739.0   <2e-16 ***
#> PCs$x[, 2]  -0.3952146  0.0004262  -927.3   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.004487 on 147 degrees of freedom
#> Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
#> F-statistic: 7.029e+05 on 2 and 147 DF,  p-value: < 2.2e-16

^{Created on 2020-01-10 by the reprex package (v0.3.0)}

Does this point you in the right direction?

loiacj · January 11, 2020, 4:57am

Yes! I will review this more tomorrow (it's midnight where I am) and reply with any additional questions that I may have. If there are none, I will mark as resolved. Thank you!

system · January 18, 2020, 4:57am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.