How to define the features that bring more variance?

I have a dataset with 10 column, that are my features, and 1732 row that are my registrations. This registration are divided in 15 classes, so I have several registration for every class in my dataset. My goal is to define what is the most important feature, the one that brings more variance between classes.

I'm trying to use PCA, but because of the several registration for every classes it's difficult to find the right way of use oof this method.

Is there another method that can I use?

I would standardize columns, train a glmnet model with family = "multinomial", and present the coefficients with largest magnitude.

PCA is an unsupervised method, so I don't think it can accomplish your objective. In other words, you have a "y variable", but PCA "only works on the x's."

I have never heard about glmnet model.. can you give me an example with a R dataset like iris or mtcars? (piece of code, link youtube or whatever)

library(glmnet)
#> Loading required package: Matrix
#> Loaded glmnet 4.1-3
model <- cv.glmnet(scale(iris[, -5]), iris$Species, family = "multinomial")
print(coef(model))
#> $setosa
#> 5 x 1 sparse Matrix of class "dgCMatrix"
#>                        1
#> (Intercept)   0.19669592
#> Sepal.Length  .         
#> Sepal.Width   1.00800522
#> Petal.Length -4.85261722
#> Petal.Width  -0.02287104
#> 
#> $versicolor
#> 5 x 1 sparse Matrix of class "dgCMatrix"
#>                      1
#> (Intercept)  3.1178113
#> Sepal.Length 0.3429274
#> Sepal.Width  .        
#> Petal.Length .        
#> Petal.Width  .        
#> 
#> $virginica
#> 5 x 1 sparse Matrix of class "dgCMatrix"
#>                       1
#> (Intercept)  -3.3145072
#> Sepal.Length  .        
#> Sepal.Width  -0.7926093
#> Petal.Length  4.7671632
#> Petal.Width   5.3142399
plot(model)

Created on 2022-02-24 by the reprex package (v2.0.1)

1 Like

can you describe the meaning of this graph?

Yes. glmnet has two hyperparameters that control variable shrinkage. Of the two, lambda is the most important. A plot of cross-validated error vs lambda shows the result of the tuning, and ideally has a "U" shape for the succesful identification of lambda value that minimizes cross-validated error by balancing fitting and over-fitting.

ok thank you! but my goal is to define the features that bring more variance to the dataset.. for example in the iris dataset, which among sepal.length sepal.width, petal.length and petal.width creates more difference between the classes setosa, verisicolor and virginica

Could you just calculate the variance (or SD) and take the largest?

Yes, that what fitting the model and observing the standardized coefficients. For example, for classifying Species setosa, Petal.Length has the largest magnitude coefficient (-4.8), Sepal.Width weaker (1.01), Petal.Width close to zero, and Sepal.Length dropped entirely.

Interpretation is a little tricky when there is very significant multicollinearity.

in setosa petal.length is -4.8 and sepal.width is 1.01.. so I have to consider the absolute value? what is the difference between a negative and a positive value? Is like a correlation?

They're coefficients in logistic regressions. Increasing petal.length drastically decreases the probability of species setosa. Sorry that I can't answer all these questions in a few sentences. Regression, coefficients, Logistic regression, etc. are fundamental topics in statistics and all together would be taught over an entire semester.

Thanks now I understood!

Sorry one last question.. can I use this method even if I have some categorical variable?

Yes. Logistic regression is the statistical approach to predicting TRUE/FALSE values, predicting a category (PASS/FAIL, APPROVED/REJETED, ...), or multiple categories either ordered (ordinal) or unordered (nominal).

Regularized regression (glmnet, lasso, ridge regression) is a more contemporary way to make linear models when there are many predictor variables with possible multicollinearity, which was challenging / required very deliberate decisions traditionally.

1 Like

Actually, there are many models you can use for your classification task. Glm models are one possibility. You can use tree-vased models as well. After that, take a look at vip package which gives methods to discover most important variables/features in your model. Read documentation to get feeling about how it works.

Thanks! This is very interesting. I find this code:

library(vip)
# Load the sample data
data(mtcars)

# Fit a projection pursuit regression model
model <- ppr(mpg ~ ., data = mtcars, nterms = 1)

# Construct variable importance plot
vip(model, method = "firm")

but I don't understand what is "importance" in the graph.. how is it calculate?

There are no simple answers.. Please see the package's vignette. You'll find all you need. Be aware of two main approaches, model specific and model agnostic ones. You got to a point where there are no shortcuts and you have to get familiar with all the nuances of those methods. Otherwise you won't be able to explain your model, understand its predictive capabilities and actual relationships among variables.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.