I am sorry, but I have already tried my best to realize the caret package document.
From the document, I have below problems :
On the page12, it gave an example to explain the variable importance. It mentioned that the agreement is 126/146 = 0.863 and the adjusted agreement is (126-85)/(146-85).
In addition, it said that "An overall measure of variable importance is the sum of the goodness of split measures for each split for which it was the primary variable, plus goodness * (adjusted agreement) for all splits in which it was a surrogate."
Question1: Where do the "126" and "85" come from?
Question2: What is the goodness of split measure? How can I calculate it?
On the page24, it gave the formula to calculate cp. It mentioned that R is the risk. It said that we can see cp as the difference between R-squared in the regression tree.
Question3: What is the risk?
Question4: If my tree is classification trees, how can I explain the cp? (I remembered that logistic regression does not have R-squared)
Question5: Please look at the below example1, I found that the cp of node 1 is (1-0.6851852)/(3-0)=0.10493827 and of node2 is (0.6851852-0.6296296)/(4-3)=0.05555556.
But why the below example2 showed the cp of node1 is 0.009070295 instead of (1-0.7823129)/(20-0)=0.01088436?
If my data does not have any missing data, I don't need to use the surrogate variables?
If the answer is true, why did the R still print the surrogate variable and other information about it?
Actually, my purpose is to calculate the cp, improve, (adjusted) agreement and variable importance by myself instead of computer.
Sorry, I have so many problems. I really want to know the rpart package more.
I hope you all can help me, thank you very much!!