Decision Tree Rpart() Summary : variable importance, improve, agree, adj and AUC

starstarstar1039 · February 18, 2019, 6:47am

I have some questions about rpart() summary.
This picture is a part of my raprt() summary.

Question 1 :
I want to know how to calculate the variable importance and improve and how to interpret them in the summary of rpart()?

Question 2 :
I also want to know what is the agree and adj in the summary of raprt()?

Question 3 :
Can I know the AUC of the tree by rpart()? If I can, how to do it?

mara · February 18, 2019, 11:24am

I believe the answers to your questions are in the introductory vignette to rpart (e.g. Section 3.4 Variable Importance). It's not very long, so definitely worth taking a look:

Another tutorial you might find helpful:
https://freakonometrics.hypotheses.org/tag/rpart

Max · February 20, 2019, 4:13am

Do you mean the area under the ROC curve? If so, there's no good way to get that directly from this model object (without doing some resampling).

starstarstar1039 · February 20, 2019, 5:54am

I see, it mentioned that the variable importance is calculated by improve, but how to calculate the improve??

starstarstar1039 · February 20, 2019, 5:56am

Yes, I mean the area under the ROC curve.
In common, I think tree is a kind of classification method, so it should have ROC curve.

Is my thinking unreasonable?

mara · February 20, 2019, 12:29pm

Improve is part of the model in the case example they're using. Here's another example tutorial with rpart, it might help you to read two different cases to distinguish between what aspects are about the example itself, versus inherent to the functionality of rpart:

Max · February 20, 2019, 5:16pm

Section 3.4 of the document that was linked by @mara. Take a few minutes and read that.

Yes and no. You have a model object that has no connection to the area under the ROC curve. If you want to get that, you need to make predictions and then calculate that.

The problem is that, if you just repredict the training set, the AUC will be inaccurate. Resampling is your best approach to estimating it correctly. I suggest taking a look at the caret package to do that (but please read the docs before asking questions about that).

starstarstar1039 · March 3, 2019, 11:23am

I am sorry, but I have already tried my best to realize the caret package document.

From the document, I have below problems :

First,
On the page12, it gave an example to explain the variable importance. It mentioned that the agreement is 126/146 = 0.863 and the adjusted agreement is (126-85)/(146-85).
In addition, it said that "An overall measure of variable importance is the sum of the goodness of split measures for each split for which it was the primary variable, plus goodness * (adjusted agreement) for all splits in which it was a surrogate."
Question1: Where do the "126" and "85" come from?
Question2: What is the goodness of split measure? How can I calculate it?

Second,
On the page24, it gave the formula to calculate cp. It mentioned that R is the risk. It said that we can see cp as the difference between R-squared in the regression tree.
Question3: What is the risk?
Question4: If my tree is classification trees, how can I explain the cp? (I remembered that logistic regression does not have R-squared)
Question5: Please look at the below example1, I found that the cp of node 1 is (1-0.6851852)/(3-0)=0.10493827 and of node2 is (0.6851852-0.6296296)/(4-3)=0.05555556.
But why the below example2 showed the cp of node1 is 0.009070295 instead of (1-0.7823129)/(20-0)=0.01088436?

Question6:
If my data does not have any missing data, I don't need to use the surrogate variables?
If the answer is true, why did the R still print the surrogate variable and other information about it?

Actually, my purpose is to calculate the cp, improve, (adjusted) agreement and variable importance by myself instead of computer.

Sorry, I have so many problems. I really want to know the rpart package more.
I hope you all can help me, thank you very much!!

Max · March 5, 2019, 12:02am

The text says

The numbers come form the table: 126 = 42 + 84 and is the number of data points that agree between the original and surrogate splits. 85 = 84 + 1 is the number of data points on the right-hand side of the grade split.

For classification, it is Gini (section 3.1)

The equation is defined in section 2 on page 5. It is a function of a probability (p(i|A)) times a loss function (L(i, \tau(a)). The probability weights the loss function by its probability of occurrence for each class. The loss function depends on the type of tree. For classification, it is typically the Gini statistic.

Generally, you can't. It isn't an interpretable number and its units are not very relatable. Basically, cp is a measure of how deep the tree is. Values around zero mean that the tree is as deep as possible and values around 0.1 mean that there was probably a single split or no split at all (depending on the data set).

(I remembered that logistic regression does not have R-squared)

Actually there are R^2 measures for logistic regression but that's besides the point.

I have no idea. I don't know where most of those numbers have come from since we don't have a reprex.

It computes them in case the data that you predict with has missing data. It prints that detail our because that what the maintainer thought that people might want to see.

You can stop them from being computed using rpart.control.

It's not problem. I suggest that you read the original book on CART. That has all of the details about the algorithm.

starstarstar1039 · March 14, 2019, 8:35am

Excuse me, I have another problem.
When the rpart package build a tree, it have default 10-fold cross validation.
My questions is that this 10-fold CV is for tree or for every split.
If it is for tree model, it should have ten tree models, then how to decide the final model?
If it is for every split, how to interpret it?

mara · March 14, 2019, 12:10pm

The implementation of cross-validation in rpart is discussed in 4.2 Cross-validation in the documentation that I linked to in my first reply (same below).

The CART book that Max linked to is highly recommended for understanding and interpreting this, but there are other resources (many freely available online) at the bottom of the tutorials below, for example (which are helpful unto themselves):
https://uc-r.github.io/regression_trees

starstarstar1039 · March 18, 2019, 7:31am

Dear all,
To be honest, I'm not an English-user so I sometimes cannot understand what the documentation shows to me. But I am sure that I do my best to figure them out. Therefore, I hope you all can explain to me in different ways (simpler sentence or vocabulary and so on). Thanks!
Below includes some previous questions that I still not understand and new questions.

In the document,
Question1:
In 3.2 Incorporating losses, I cannot realize 3.2.2 Altered priors exactly and why it is better than 3.2.1 Generalized Gini index? Maybe I can know the reason after I realize the Altered priors.

Question2:
In 3.4Variable importance, the text says "only those whose utility is greater than the baseline "go with the majority" surrogate", what does it mean?
Additionally, when it calculates the adjusted agreement, why does it reduce 85(84+1), which is the number of data points on the right-hand side of the grade split? Why not the left-hand side(42+16+3)?

Question3:
In step1 of 4.2Cross-validation, how to compute I1 to Im and what do all beta mean?
In step 3 of 4.2Cross-validation, I cannot understand all How to decide the final tree?

THANK YOU ALL!

system · April 8, 2019, 7:31am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.