Impute values to a dataframe before PCA - missMDA

Delphine · January 31, 2023, 9:27am

Hi !

I am trying to do a PCA on a dataset with genotypic and environmental info. I have about 30 phenotypic descriptors and all of them have hundreds of NA values.
I try to do my PCA with factoextra, but it does not accept missing values.

prcomp(colQ, scale = T)
Error in svd(x, nu = 0, nv = k) : valeurs infinies ou manquantes dans 'x'

I tried to impute them with the package missMDA ;

nb <- estim_ncpPCA(data, ncp.max=5)
comp <- imputePCA(data,
                  ncp=nb$ncp,
                  scale=TRUE)

Problem : when I want to use the imput values, I have an error message:

prcomp(comp, scale = T)
Error in prcomp(as.numeric(comp), scale = T) : 
  'list' object cannot be coerced to type 'double'

Because comp if for some reason a list of 2 doubles of 1025 x 9 values
(My data is 1025 x 9 length).
One element of comp is “CompleteObs”, the other is “fittedX”

I tried to imput only one of the vectors in the prcomp() function and the following steps for a PCA:

pca_fitted <- prcomp(comp$fittedX, scale = T)
pca_comp <- prcomp(comp$completeObs, scale = T)

summary(pca_fitted)
summary(pca_comp)

fviz_eig(pca_fitted)
fviz_eig(pca_comp)

But the results are completely different !

> summary(pca_fitted)
Importance of components:
                          PC1    PC2       PC3       PC4      PC5       PC6       PC7
Standard deviation     2.5236 1.6221 1.197e-14 1.789e-15 1.39e-15 6.524e-16 4.264e-16
Proportion of Variance 0.7076 0.2924 0.000e+00 0.000e+00 0.00e+00 0.000e+00 0.000e+00
Cumulative Proportion  0.7076 1.0000 1.000e+00 1.000e+00 1.00e+00 1.000e+00 1.000e+00
                             PC8       PC9
Standard deviation     3.133e-16 1.585e-16
Proportion of Variance 0.000e+00 0.000e+00
Cumulative Proportion  1.000e+00 1.000e+00

> summary(pca_comp)
Importance of components:
                         PC1    PC2    PC3     PC4     PC5     PC6    PC7     PC8     PC9
Standard deviation     1.518 1.2748 1.0158 0.94495 0.93199 0.86790 0.7672 0.69414 0.67380
Proportion of Variance 0.256 0.1806 0.1147 0.09921 0.09651 0.08369 0.0654 0.05354 0.05045
Cumulative Proportion  0.256 0.4365 0.5512 0.65041 0.74692 0.83062 0.8960 0.94955 1.00000

The following graphics and analysis are also very different from each other.

Do you know if is it okay to use FittedX or CompletObs for this analysis?
And do you know the difference between them?
Thank you very much for you help!

Del

nirgrahamuk · January 31, 2023, 10:55am

its expected that you would use $completeObs ; the example in the documentation does this.
As it combines your non-missing values with the imputed values; its the superior choice.

the fittedX is as if any value were to be replaced; what imputePCA would have guessed they would be.

Delphine · January 31, 2023, 11:44am

All right, thanks a lot.

Would you think I can trust it for other analysis as well (like simple correlations analysis)?

nirgrahamuk · January 31, 2023, 11:46am

I don't think fittedX will be useful for you in anyway; but I don't know.

system · March 14, 2023, 11:47am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.