Tidymodels - random forest

France · May 1, 2023, 10:33pm

Hello!

I am a researcher working on cancer data and I have a classification problem. I am working with cancer data and one of the problem we have is distinguishing tumor vs. normal cells. For a subset of my data I know exactly which are the tumor cells and which are instead the normal ones. So the idea would be to use these to train and test the classifier and then classify new data.

The data are in the format of cells x genes. So each row is one cell and each column is a gene (predictor) according to the expression pattern of all the genes (columns) in each cell, the cell will be classified as tumor or normal. The training and testing works well the accuracy is 0.95 and all the metrics look good.

The only problem is that when I try to predict on new data I need to classify I get an error because not all predictors (all the genes) I used to train and test the classifier are present in the new data

Error in `validate_column_names()`:
! The following required columns are missing

Do always all the predictors used to train and test be in the new data we want to classify?

Thanks
Francesco

hannah · May 2, 2023, 8:43am

Generally: yes. You train your model on the relationship between a set of predictors and the outcome. If you change the set of predictors, e.g. by taking several of them out of the set, you have a (potentially) different relationship, one which your model does not capture.

Why do you have fewer predictors for the new data? Generally, all the data should come from the same source. For example, if you train a model on one type of cell and then use it to predict cells of a different type, it may not work well because the underlying relationship is different.

system · May 23, 2023, 8:43am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.