I trained a random forest model and a KNN model, the former gave an r-squared of 0.90, and the latter had an r-squared of 0.86. The RF model could predict 95 % of testing data, and the KNN model could predict 92 %.
There are inconsistencies in the variable importance, as the most important variable in the RF model is ranked 4th or 5th in the KNN model VIP() calculations, and vise versa. What do you think about this inconsistency? is this situation, generally, not uncommon?
There would be inconsistencies between KNN and Random Forest when it comes to variance explained and to variable importance. Depending on how complex your data problem and features are the RF model might have better gauged the importance etc. It is hard to say which one captured the importance best given we know nothing about the size of your data, how it was collected and there is no good completely independent test set to compare against.
Overall - random forest is good at providing more consistent variable importance than KNN but this comes with a big disclaimer of it depends on a lot of other factors. The more important thing to determine is if you would cross fold do the estimates change drastically for your original importances. If they do vary drastically then you should evaluate those ranges and overlap and likely lead to the conclusion you cannot really determine the most important variables consistently.
Thank you for the reply. I did cross-fold the RF model by using the CV argument in the caret::train function with K-folds between 2 and 10. The variables are ranked in the same order regardless of the K-fold number. However, there is an odd observation when I refined the model and chose the five most important variables out of 10 in total. In the variable importance results of the refined model, (with 5 explanatory variables), the second-best variable in the 10-input variable model outranked the most important variable.
That can definitely happen. In the 10 variable model the second-best variable likely shared "information" with other variables in the set but in the reduced set it contributed more than the other 4 that were added. Therefore, you can end in up in a situation where the most important will differ depending on what else is included (once again it depends on how big that difference was to begin with because your exact number is technically an estimate and there will be a range). This is why order effects are so dangerous in normal backward/forward regressions. This is where I typically prefer to do best subset regressions and see ideal n number of IVs sets against the DV.
Thank you so much for your quick replies, it means a lot to me.
Also, I was wondering if independent variables are highly correlated, would that weaken the output of the random forest model? I mean, can I still trust the r-squared and other metrics of an RF model, despite the presence of multicollinearity?
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.