I am a researcher working on cancer data and I have a classification problem. I am working with cancer data and one of the problem we have is distinguishing tumor vs. normal cells. For a subset of my data I know exactly which are the tumor cells and which are instead the normal ones. So the idea would be to use these to train and test the classifier and then classify new data.
The data are in the format of cells x genes. So each row is one cell and each column is a gene (predictor) according to the expression pattern of all the genes (columns) in each cell, the cell will be classified as tumor or normal. The training and testing works well the accuracy is 0.95 and all the metrics look good.
The only problem is that when I try to predict on new data I need to classify I get an error because not all predictors (all the genes) I used to train and test the classifier are present in the new data
Error in `validate_column_names()`: ! The following required columns are missing
Do always all the predictors used to train and test be in the new data we want to classify?