Decision Tree Rpart() Summary Interpretation

blackish952 · June 20, 2018, 12:53pm

I have a model as follow:

Here is what the data frame looks like after I tailored down the unnecessary details that would not make sense in my model:

     str(df)
    'data.frame':	991205 obs. of  6 variables:
     $ cust_prog_level   : Factor w/ 14 levels "B","C","D","E",..: 9 7 5 9 10 5 5 12 9 10 ...
     $ CUST_REGION_DESCR : Factor w/ 8 levels "CORPORATE REGION",..: 3 3 3 6 3 3 3 3 3 3 ...
     $ ACCTG_MONTH_KEY   : int  201801 201709 201804 201803 201801 201705 201712 201801 201803 201705 ...
     $ Sales             : num  150.2 75.1 76.2 135 150.2 ...
     $ New_Product_Type  : Factor w/ 2 levels "Not_PL","PL": 1 1 1 1 1 1 1 1 1 1 ...
     $ MAJOR_CATEGORY_KEY: Factor w/ 26 levels "AIR","AML","ANS",..: 23 23 23 23 23 23 23 23 23 23 ...

My model is as follow:

 set.seed(500)
 nobs = nrow(df)
 train <- sample(nrow(df), 0.7*nobs)
 test <- setdiff(seq_len(nrow(df)), train)
 
 # Build the Decision Tree model.
 fit <- rpart(New_Product_Type~.,
                data=df[train, ],
                method="class")
 
 fancyRpartPlot(fit, main="Decision Tree Graph")

My goal is to see from the analysis result, how can I make a decision which alley to invest in order to have more people choose "PL" instead of "No_PL".

Here is the summary:

Call:
rpart(formula = New_Product_Type ~ ., data = df[train, ], method = "class")
  n= 693843 

         CP nsplit rel error    xerror        xstd
1 0.2127376      0 1.0000000 1.0000000 0.001963200
2 0.1113809      1 0.7872624 0.7872624 0.001809865
3 0.0100000      3 0.5645007 0.5645377 0.001590637

Variable importance
MAJOR_CATEGORY_KEY              Sales    cust_prog_level 
                74                 25                  1 

Node number 1: 693843 observations,    complexity param=0.2127376
  predicted class=Not_PL  expected loss=0.2721696  P(node) =1
    class counts: 505000 188843
   probabilities: 0.728 0.272 
  left son=2 (423101 obs) right son=3 (270742 obs)
  Primary splits:
      MAJOR_CATEGORY_KEY splits as  LRRRLRRRLLLLLLLLLRLLLLLL-R, improve=80999.46000, (0 missing)
      Sales              < 28.655   to the right, improve=58633.41000, (0 missing)
      cust_prog_level    splits as  LLLLLLLLRLLLRR, improve= 2998.43000, (0 missing)
      CUST_REGION_DESCR  splits as  LLRRRRRL, improve=  725.28710, (0 missing)
      ACCTG_MONTH_KEY    < 201706.5 to the left,  improve=   80.10278, (0 missing)
  Surrogate splits:
      Sales           < 28.645   to the right, agree=0.706, adj=0.246, (0 split)
      cust_prog_level splits as  LLLLLLLLLLLLLR, agree=0.613, adj=0.009, (0 split)

Node number 2: 423101 observations
  predicted class=Not_PL  expected loss=0.07890551  P(node) =0.6097936
    class counts: 389716 33385
   probabilities: 0.921 0.079 

Node number 3: 270742 observations,    complexity param=0.1113809
  predicted class=PL      expected loss=0.4258076  P(node) =0.3902064
    class counts: 115284 155458
   probabilities: 0.426 0.574 
  left son=6 (203647 obs) right son=7 (67095 obs)
  Primary splits:
      MAJOR_CATEGORY_KEY splits as  -RRL-LLR---------L-------L, improve=25580.510000, (0 missing)
      Sales              < 30.31    to the right, improve=23028.650000, (0 missing)
      cust_prog_level    splits as  LLLLLLLLRLLLRR, improve= 1592.459000, (0 missing)
      CUST_REGION_DESCR  splits as  LLRRRRRL, improve=  338.500700, (0 missing)
      ACCTG_MONTH_KEY    < 201801.5 to the right, improve=    3.721797, (0 missing)
  Surrogate splits:
      Sales           < 7.095    to the right, agree=0.812, adj=0.241, (0 split)
      cust_prog_level splits as  LLLLLLLLLLLLRL, agree=0.752, adj=0.000, (0 split)

Node number 6: 203647 observations,    complexity param=0.1113809
  predicted class=Not_PL  expected loss=0.4494346  P(node) =0.2935059
    class counts: 112121 91526
   probabilities: 0.551 0.449 
  left son=12 (95969 obs) right son=13 (107678 obs)
  Primary splits:
      Sales              < 57.265   to the right, improve=10319.280000, (0 missing)
      cust_prog_level    splits as  LLLLLLLLRLRLLR, improve= 1514.172000, (0 missing)
      CUST_REGION_DESCR  splits as  LLRRLRRL, improve=  204.100600, (0 missing)
      MAJOR_CATEGORY_KEY splits as  ---R-LL----------R-------R, improve=   86.075000, (0 missing)
      ACCTG_MONTH_KEY    < 201706.5 to the left,  improve=    5.466954, (0 missing)
  Surrogate splits:
      cust_prog_level    splits as  RRLRRLLRRLRRLR, agree=0.561, adj=0.068, (0 split)
      MAJOR_CATEGORY_KEY splits as  ---L-LL----------R-------R, agree=0.539, adj=0.022, (0 split)

Node number 7: 67095 observations
  predicted class=PL      expected loss=0.04714211  P(node) =0.09670055
    class counts:  3163 63932
   probabilities: 0.047 0.953 

Node number 12: 95969 observations
  predicted class=Not_PL  expected loss=0.2808303  P(node) =0.1383152
    class counts: 69018 26951
   probabilities: 0.719 0.281 

Node number 13: 107678 observations
  predicted class=PL      expected loss=0.4002953  P(node) =0.1551907
    class counts: 43103 64575
   probabilities: 0.400 0.600 

n= 693843 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 693843 188843 Not_PL (0.72783036 0.27216964)  
   2) MAJOR_CATEGORY_KEY=AIR,ASP,CBL,CEM,CMP,CRN,END,FNP,GYP,HND,IMP,OTH,P&P,PRE,RTC,SME,UCL 423101  33385 Not_PL (0.92109449 0.07890551) *
   3) MAJOR_CATEGORY_KEY=AML,ANS,ASE,B&D,BLE,C&P,INS,XRY 270742 115284 PL (0.42580760 0.57419240)  
     6) MAJOR_CATEGORY_KEY=ASE,B&D,BLE,INS,XRY 203647  91526 Not_PL (0.55056544 0.44943456)  
      12) Sales>=57.265 95969  26951 Not_PL (0.71916973 0.28083027) *
      13) Sales< 57.265 107678  43103 PL (0.40029532 0.59970468) *
     7) MAJOR_CATEGORY_KEY=AML,ANS,C&P 67095   3163 PL (0.04714211 0.95285789) *

Here is the graph:

Here are my questions:

Question 1: From Node number 1, what does this mean?

left son=2 (423101 obs) right son=3 (270742 obs)

How to interpret the Primary Splits?

Primary splits:
      MAJOR_CATEGORY_KEY splits as  LRRRLRRRLLLLLLLLLRLLLLLL-R, improve=80999.46000, (0 missing)
      Sales              < 28.655   to the right, improve=58633.41000, (0 missing)
      cust_prog_level    splits as  LLLLLLLLRLLLRR, improve= 2998.43000, (0 missing)
      CUST_REGION_DESCR  splits as  LLRRRRRL, improve=  725.28710, (0 missing)
      ACCTG_MONTH_KEY    < 201706.5 to the left,  improve=   80.10278, (0 missing)
  Surrogate splits:
      Sales           < 28.645   to the right, agree=0.706, adj=0.246, (0 split)
      cust_prog_level splits as  LLLLLLLLLLLLLR, agree=0.613, adj=0.009, (0 split)

For MAJOR_CATEGORY_KEY*, there are 26 levels:

> levels(MAJOR_CATEGORY_KEY)
 [1] "AIR " "AML " "ANS " "ASE " "ASP " "B&D " "BLE " "C&P " "CBL " "CEM " "CMP " "CRN " "END "
[14] "FNP " "GYP " "HND " "IMP " "INS " "OTH " "P&P " "PRE " "RTC " "SME " "UCL " "UNK " "XRY "

so can I assume that in the split criteria which have a string of:

 MAJOR_CATEGORY_KEY splits as  LRRRLRRRLLLLLLLLLRLLLLLL-R, improve=80999.46000, (0 missing)

LRRRR etc. correspond to where each level in the "MAJOR CATEGORY KEY" goes, i.e first level goes left, second level goes right etc.?

Question 2: For Node Number 2

Node number 1: 693843 observations,    complexity param=0.2127376
  predicted class=Not_PL  expected loss=0.2721696  P(node) =1
    class counts: 505000 188843
   probabilities: 0.728 0.272 
  left son=2 (423101 obs) right son=3 (270742 obs)
  Primary splits:
      MAJOR_CATEGORY_KEY splits as  LRRRLRRRLLLLLLLLLRLLLLLL-R, improve=80999.46000, (0 missing)
      Sales              < 28.655   to the right, improve=58633.41000, (0 missing)
      cust_prog_level    splits as  LLLLLLLLRLLLRR, improve= 2998.43000, (0 missing)
      CUST_REGION_DESCR  splits as  LLRRRRRL, improve=  725.28710, (0 missing)
      ACCTG_MONTH_KEY    < 201706.5 to the left,  improve=   80.10278, (0 missing)
  Surrogate splits:
      Sales           < 28.645   to the right, agree=0.706, adj=0.246, (0 split)
      cust_prog_level splits as  LLLLLLLLLLLLLR, agree=0.613, adj=0.009, (0 split)
Node number 2: 423101 observations
  predicted class=Not_PL  expected loss=0.07890551  P(node) =0.6097936
    class counts: 389716 33385
   probabilities: 0.921 0.079

The P(node) =0.6097936 means the chance I will get from node 1 to node 2 is about 61%. From Node number 1 splitting criteria, if Sales >=28.645, then I will go to the left. Combining these two information, I would say there is about 39% chance that if the Sales < 28.645, there is only 39% chance I will go to the right branch??

Does it sound right?

Question 3: What does a surrogate split do?

Max · June 20, 2018, 7:57pm

That's a lot of details. You would be better served at looking at the simple output from print.rpart

"son" tells you the number of the next node below that split. The "obs" numbers are how many of the training data are on each side.

Those are the leading variables that could have been used in a split.

Yes.

Not quite. The actual first split is on MAJOR_CATEGORY_KEY. Looking at the first node's output:

left son=2 (423101 obs) right son=3 (270742 obs)

There is a 423101/(423101+270742) = 61% chance that a random data point would go down the path to node #2.

It is confusing because it is showing you the actual split and what the runners-up were.

They are used when there are missing predictor values. Another split, that approximates the original split's results, can be used it its values are not missing.

This page has some good information on that.

blackish952 · June 20, 2018, 9:12pm

@Max
Max:
Thank you very much for spending your time with my post.
I have more questions that need your help to clarify.

Question 4 Is there a reason why the Node number does not go in order? Node 1, 2, 3,6,7,12,13.

Question 5

Not quite. The actual first split is on MAJOR_CATEGORY_KEY. Looking at the first node's output:

left son=2 (423101 obs) right son=3 (270742 obs)
There is a 423101/(423101+270742) = 61% chance that a random data point would go down the path to node #2.

It is confusing because it is showing you the actual split and what the runners-up were

Can you please explain to me what the last statement means? What were the "runners-up"?
Why is it confusing when the plot shows me the actual split?

Question 6 I noticed that in my plot, below the first node are the levels of Major Cat Key but it does not have all the levels. I counted 17 levels below node 1 (I forgot to mention that this plot did not include 4 levels) and 5 levels below Node 3 since I know there are a total of 26 levels in Major Cat Key.

What does this happen?

Question 7 To sum up, I should read the tree based on the Primary Splits, correct?

If so, please spend some time reading my interpretation because I need to present in a meeting and I cannot act as a fool:
From Node 1

MAJOR_CATEGORY_KEY splits as  LRRRLRRRLLLLLLLLLRLLLLLL-R 
#there are 25 levels here if I did not miscount. Again, I have 26 levels in total.
> levels(MAJOR_CATEGORY_KEY)
 [1] "AIR " "AML " "ANS " "ASE " "ASP " "B&D " "BLE " "C&P " "CBL " "CEM " "CMP " "CRN " "END "
[14] "FNP " "GYP " "HND " "IMP " "INS " "OTH " "P&P " "PRE " "RTC " "SME " "UCL " "UNK " "XRY "

So I do not know how to split this when I am missing one level.

CUST_REGION_DESCR  splits as  LLRRRRRL, improve=  725.28710, (0 missing)

This seems right because I have 8 levels for this factor so I will just split the 8 levels in order as the string "LLRRRRRL" indicates.

My concern is: do I split my tree based on the Sales?

 Sales              < 28.655   to the right, improve=58633.41000, (0 missing)

If so, it does not make sense if you read Node 3

Primary splits:
      MAJOR_CATEGORY_KEY splits as  -RRL-LLR---------L-------L, improve=25580.510000, (0 missing)
      Sales              < 30.31    to the right, improve=23028.650000, (0 missing)
# Well, from Node 1, if Sales < 29$, we go to Node 3. 
# So how come the splitting criteria here is that 
#   if Sales again are less than 30.31$, we go right.
# This does not make sense because from Node 1,
#   once you go right, all the elements in Node 3 have prices less than 29$

Am I understanding this correctly?

Question 8 For the terminal Nodes, for example Node 7.

    PL
0.05 0.95
  10%   ----> Again, I know you explained to me what this percentage means but I still don't understand how can 
             this Node helps me determine whether the case "if all the items fall through from Node 1, Node 3 
             down to Node 7, then it is a good chance customer will buy House Brand if we do X Y Z" is true

Max · June 20, 2018, 9:44pm

Yes but, off the top of my head, I don''t recall. It is possible that there are nodes that are not admissible (=useful) but their children might be. For example, when you look at your cptable results, there was no tree with 2 splits:

          CP nsplit rel error    xerror        xstd
 1 0.2127376      0 1.0000000 1.0000000 0.001963200 
 2 0.1113809      1 0.7872624 0.7872624 0.001809865 
 3 0.0100000      3 0.5645007 0.5645377 0.001590637

That's because that 2 split tree was always worse than everything else so it is not reported. The missing node numbers might be due to that.

You thought that Sales was part of the split but it wasn't. The output that you were shown has a lot of information about what the split could have been. I find that confusing.

I don't know but would assume that it has limits on what it shows you based on screen space. The end has "LLLL-R" and the "-" might mean "everything up unitl the last value"

yes.

This is the output that you should focus on:

1) root 693843 188843 Not_PL (0.72783036 0.27216964)  
   2) MAJOR_CATEGORY_KEY=AIR,ASP,CBL,CEM,CMP,CRN,END,FNP,GYP,HND,IMP,OTH,P&P,PRE,RTC,SME,UCL 423101  33385 Not_PL (0.92109449 0.07890551) *
   3) MAJOR_CATEGORY_KEY=AML,ANS,ASE,B&D,BLE,C&P,INS,XRY 270742 115284 PL (0.42580760 0.57419240)  
     6) MAJOR_CATEGORY_KEY=ASE,B&D,BLE,INS,XRY 203647  91526 Not_PL (0.55056544 0.44943456)  
      12) Sales>=57.265 95969  26951 Not_PL (0.71916973 0.28083027) *
      13) Sales< 57.265 107678  43103 PL (0.40029532 0.59970468) *
     7) MAJOR_CATEGORY_KEY=AML,ANS,C&P 67095   3163 PL (0.04714211 0.95285789) *

I count 26 categories in the lines for splits 2 and 3. Sales was used in a split (see nodes 12 and 13) and the major category was split on twice (which is perfectly valid).

Instead of look at that output, look at the tree in my answer to Q7. The percentages there tell you the class probabilities for each of the terminal nodes. For example, if a new sample had category "AIR", it would fall into node #2:

   2) MAJOR_CATEGORY_KEY=AIR,ASP,CBL,CEM,CMP,CRN,END,FNP,GYP,HND,IMP,OTH,P&P,PRE,RTC,SME,UCL 423101  33385 Not_PL (0.92109449 0.07890551) *

"(0.92109449 0.07890551) " means that there is a 92.1% probability that the new sample has the class "Not_PL".

The value before these proportions is the predicted class for that terminal node ("Not_PL" in this case).

Is that what you were asking?

blackish952 · June 20, 2018, 10:10pm

@Max
Max: Thank you so much for the info.

One thing that I would say about my tree is that I was hoping it would include the Customer Rewards Levels and the Customer Region because in the back of my head, I always take the prime example of the tennis game which is a very well-known example for Decision Tree Analysis. I believe it has some attributes such as wind, temperature, humidity etc.

I am not sure if I am confident enough to present this model.
I will read though the last final output you mentioned above and start thinking more about this.
Thanks a lot, once again!

Max · June 20, 2018, 10:12pm

No problem. Good luck!

blackish952 · June 21, 2018, 1:27pm

(Note: I took off the @user. I did not mean to do this so I can bump my post up!)
Max:
I have further questions.
Question 9: You suggested I should read the final output of summary(fit). Here it is and it does not make sense to me.

 1) root 693843 188843 Not_PL (0.72783036 0.27216964)  
   2) MAJOR_CATEGORY_KEY=AIR,ASP,CBL,CEM,CMP,CRN,END,FNP,GYP,HND,IMP,OTH,P&P,PRE,RTC,SME,UCL 423101  33385 Not_PL (0.92109449 0.07890551) *
   3) MAJOR_CATEGORY_KEY=AML,ANS,ASE,B&D,BLE,C&P,INS,XRY 270742 115284 PL (0.42580760 0.57419240)  
     6) MAJOR_CATEGORY_KEY=ASE,B&D,BLE,INS,XRY 203647  91526 Not_PL (0.55056544 0.44943456)  
      12) Sales>=57.265 95969  26951 Not_PL (0.71916973 0.28083027) *
      13) Sales< 57.265 107678  43103 PL (0.40029532 0.59970468) *
     7) MAJOR_CATEGORY_KEY=AML,ANS,C&P 67095   3163 PL (0.04714211 0.95285789) *

The reason why it does not make sense to me is that:

From Node 3 to Node 6, for instance, the decision tree says if a customer buys ASE, they will tend to buy a PL product. Correct?
Now in Node 6, the tree contradicts itself and says No, No! if a customer buys ASE, they will tend to buy a non-PL product.
Am I missing something here?
The detailed decision tree does not make sense.
Node 1 says:

Sales              < 28.655   to the right, improve=58633.41000, (0 missing)

If so, it does not make sense if you read Node 3

Primary splits:
      MAJOR_CATEGORY_KEY splits as  -RRL-LLR---------L-------L, improve=25580.510000, (0 missing)
      Sales              < 30.31    to the right, improve=23028.650000, (0 missing)

Well, from Node 1, if Sales < 29$, we go to Node 3.
So how come the splitting criteria here at Node 3 is that if Sales again are less than 30.31$, we go right.
This does not make sense because from Node 1, once you go right, all the elements in Node 3 have prices less than 29$

Max · June 22, 2018, 2:46pm

Yes, but not really. Look at the probabilities for node 6: 55% vs 45%. That's a weak split and it was most likely kept because node 12 adds some predictive performance. It's basically an interaction between the category key and the sales variables.

Really, don't look at these. They are what would have happend if there were not a better split. Also, node 3 and node 1 do not use the same data so you can't make apples-to-apples comparisons. Node 3 is a smaller subset that may be enriched with data from one class.

Sales was not used in node 1 or 3.

blackish952 · June 22, 2018, 2:57pm

Max:
So, just look at the final output?

1) root 693843 188843 Not_PL (0.72783036 0.27216964)  
   2) MAJOR_CATEGORY_KEY=AIR,ASP,CBL,CEM,CMP,CRN,END,FNP,GYP,HND,IMP,OTH,P&P,PRE,RTC,SME,UCL 423101  33385 Not_PL (0.92109449 0.07890551) *
   3) MAJOR_CATEGORY_KEY=AML,ANS,ASE,B&D,BLE,C&P,INS,XRY 270742 115284 PL (0.42580760 0.57419240)  
     6) MAJOR_CATEGORY_KEY=ASE,B&D,BLE,INS,XRY 203647  91526 Not_PL (0.55056544 0.44943456)  
      12) Sales>=57.265 95969  26951 Not_PL (0.71916973 0.28083027) *
      13) Sales< 57.265 107678  43103 PL (0.40029532 0.59970468) *
     7) MAJOR_CATEGORY_KEY=AML,ANS,C&P 67095   3163 PL (0.04714211 0.95285789) *

From Node 1 to Node 2 and Node 3, there is not splitting criteria. Just follow what Node 2 and Node 3 say.

Yes, but not really. Look at the probabilities for node 6: 55% vs 45%. That's a weak split and it was most likely kept because node 12 adds some predictive performance. It's basically an interaction between the category key and the sales variables.

So after all, how can I traverse the decision tree? For instance, what is the criteria to go from Node 3 to node 6?
I cannot find a good interpretation on my end to understand why there is a path from Node 3 to Node 6?

For Node 7, can I conclude that if your product keys are AML, ANS, C&P, people will tend to buy PL for these keys?

Thanks, Max! I appreciate it.

Max · June 22, 2018, 3:21pm

Node 6 is just a more refined version of node 3. Terminal node 12 is only

if
   MAJOR_CATEGORY_KEY=ASE,B&D,BLE,INS,XRY & 
   Sales>=57.265 95969
then 
   Not_PL with probability 71.9%

Terminal node 13 is:

if
   MAJOR_CATEGORY_KEY=ASE,B&D,BLE,INS,XRY & 
   Sales < 57.265 95969
then 
   PL with probability 59.9%

Node 7 is just

if
   MAJOR_CATEGORY_KEY=AML,ANS,C&P
then 
   PL with probability 95.2%

Node 2 is pretty simple and definitively Not_PL with 92.1% probability.

blackish952 · June 22, 2018, 3:22pm

Max:
I know I am the pain in the butt.
But thank you.
I have learned a lot the past few days.
I appreciate the input!