Warning glm.fit in pls logistic regression

Hey guys,
I am currently working on a project on pruning. I have three tables with up to 6 000 variables (it's text mining), and most variables only appear for a few observation. Am using a a logistic_pls regression with the package plsRglm (i this package require glmnet to work so it might be linked). But even after suppressing up to 90 of the unuseful variables i get 50 of the following warnings: "glm.fit: fitted probabilities numerically 0 or 1 occurred". I tries to look it up but the the solution I found on internet were not useful in my case.

Here is my code :

Reg4<-plsRglm(Agregate.progression~., Base_traitement4, modele="pls-glm-logistic",control=glm.control(maxit=100), nt=6, pvals.expli=T)
Reg4
finalmod4 <- Reg4$FinalModel

Base_traitement3 has 77 variables for 43 observation

I am a student in statistics and I still have a lot to learn, I'll be grateful if any of you can help me :slight_smile:
PS: am not a native English speaker so sorry if my english is so far from perfect, and sometimes i don't get it when people use abbreviation of statistical terms.

1 Like

My guess is that you have a data leak where at least one variable perfectly correlates with your target variable.

1 Like

Lot to learn about statistics! Personally in that realm I'm an ant in the Amazon forest and after 12 years of chewing haven't progressed much past my first fallen tree.

English is a world language. Even those of us who spoke it first and first learned to read and write in it use it wildly differently. The only standard for English is whether two people can make the needed adjustment on both ends to make it work for the communication needed. And by that standard our versions of English (or, at least, yours) are superb. The only language rules that count in this community are the rules of R syntax.

I agree with @nirgrahamuk's assessment that many of the variables may show perfect collinearity, especially when they are perfectly correlated or composed of NAs.

But there are two other problems.

The first is that covariates \gg Aggregate.progression.

Applied Logistic Regression 3rd Edition by David W. Hosmer Jr., Stanley Lemeshow and Rodney X. Sturdivant (2009) has an excellent treatment of covariate selection for this case in Chapter 4: the three strategies of purposeful, step-wise forwards or stepwise backwards. stepwise forward and best subset. In all approaches, however, they caution against overfitting, which leads to numerically unstable test results.

It seems in your case that 6,000 covariates may reflect a generous vocabulary. From the goals of the analysis are all of them required? To take a trivial example, the common stopwords in a text corpus are routinely discarded in natural language processing because their frequency overweights their scant semantic load. As the text authors note, the inclusion standard for any covariate is does its inclusion provide more information (as measured by a metric such as goodness of fit) if it is included?

Even in the reduced case of a 43 x 77 set, the noise produced is deafening in its crying out for feature reduction. Unless there is an a priori domain principle for selecting candidate variables, the tradeoffs are too complex.

What seems promising to be, in principle (says the man who is not embarking upon it himself) is subset selection by bootstrap sampling of the candidate covariates a handful at a time, applying a goodness of fit test such as the Hosmer-Lemeshaw found in ResourceSelection::hoslem.test after first filtering by log likelihood. I've done subset testing on 10,000x20 data on an underpowered machine using that method, so subsets of 20 covariates is feasible. The combination of 77 covariates taken 15 at a time, say, combn[77,15] \approx 3.527931e+15 , is too many to exhaust without an array of very large instance cloud resources operating in parallel (my guess). However, putting the survivors into a single- or double- elimination process may yield a useful selection.

Quick caveat: it's conceivable, but I don't know, that something akin to Bayesian autocorrelation may rear its head due to co-occurrences of n-grams in the text, words that are routinely followed or preceded by specific other words.

The second problem that I have in mind is data structure. It might be better with the lexical tokens as rows and the number of occurrences as the sole variable for the purpose of creating Bayesian priors and attacking from that flank Rank speculation.

Finally, a question. I'm certain that NLP packages among them address this class of problem. Are you open to searching among them for that functionality with me?

1 Like

Hye,
First thank very much for your response.

About the data I used, the words are the abstract of medical papers, so the vocabulary used is very rich and specific. The stop-words were already deleted but our variables are not only words, sometimes it group a of words, since the same word can have different meaning depending of the context.
My project is about pruning so I do the regression on different sample of of the data after deleting some of the variables. I deleted up to 85% of the variables.
Since so many of them exist one observation only, so I though it might certainly lead to overfiting and deleted them. Then I tried deleting be quantile of frequency. Since medical vocabulary is so rich most of the word only come back a few times in the data. For example in a table of 2052 variables, half of them only appear once, and 80% of the variables appear 5 times or less in the data table.
I tried doing the regression on a verry small sample of 20 variables i but still had the warning, after a look a the table used no variable was colinear to my response variables.
Is it so much of a problem if there are more variables isn't the PLS part of the regression regression supposed to deal with that.

for the second problem my project is about predicting if a research paper shows a link between lung cancer and viral infection. Each line is a paper so am not if changing the structure is really possible.

Am doing some research on subset selection and try to apply it to, to see how it turns out.

Once again thank you for your response

1 Like

OK, gotcha. Let's see if I can summarize it abstractly.

  1. The goal is to develop a classification model of scientific papers that identify a link between viral infection as the treatment variable with lung cancer as the response variable.
  2. A criterion will be needed to identify those papers in the literature that deal with the subject matter at all. An algorithm is needed to do that. Possible strategies to do that can rely on some combination of keyword, n-gram or other NLP content of the papers, the classification of the publication in which they appear, their subsequence citation history and machine learning approaches tested against gold standard human classification.
  3. A representative limited dataset of papers know to address the relationship of interest and those known not to address the relationship can provide a test set to tune the classification algorithm.
  4. A design decision must be made as to the desired precision and recall.
  5. In combination, the rich set of potential features in the source material, its meta-information (such as type of publication) may call for reduction of the features to be considered directly. For example latent semantic analysis may provide a more tractable feature than individual lexical tokens.

It sounds like a fascinating and challenging project!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.