rpart result is too small to see

syl9005 · April 10, 2020, 7:13am

I was following a tutorial that shows how decision tree works

path <- 'https://raw.githubusercontent.com/guru99-edu/R-Programming/master/titanic_data.csv'
titanic <-read.csv(path)
head(titanic)
shuffle_index <- sample(1:nrow(titanic))
head(shuffle_index)
titanic <- titanic[shuffle_index, ]
head(titanic)
library(dplyr)
clean_titanic <- titanic
clean_titanic <-select(clean_titanic,-home.dest,-cabin,-name,-x,-ticket)
clean_titanic <-mutate(clean_titanic, pclass = factor(pclass, levels = c(1, 2, 3), labels = c('Upper', 'Middle', 'Lower')),
survived = factor(survived, levels = c(0, 1), labels = c('No', 'Yes')))
clean_titanic=replace(clean_titanic,clean_titanic == "?",NA)
clean_titanic <- na.omit(clean_titanic)
glimpse(clean_titanic)

create_train_test <- function(clean_titanic, size = 0.8, train = TRUE) {
n_row = nrow(clean_titanic)
total_row = size * n_row
train_sample <- 1: total_row
if (train == TRUE) {
return (clean_titanic[train_sample, ])
} else {
return (clean_titanic[-train_sample, ])
}
}
data_train <- create_train_test(clean_titanic, 0.8, train = TRUE)
dim(data_train)
data_test <- create_train_test(clean_titanic, 0.8, train = FALSE)
dim(data_test)

prop.table(table(data_train$survived))

library(rpart)
library(rpart.plot)
fit <- rpart(survived~.,data=data_train, method = 'class')

so far I manage to create the decision tree but could see the result. I mean literally CANT SEA IT..

how can I shorten the name(? Im not sure what that long letter is..) or is there any problem in my sentence?

siddharthprabhu · April 10, 2020, 7:52am

Hi @syl9005, welcome to RStudio Community.

I think you're having trouble because the age and fare variables are of type factor whereas they should be numeric. Change your mutate() call as shown below and see if you get a better result.

mutate(clean_titanic, 
       pclass = factor(pclass, levels = c(1, 2, 3), labels = c('Upper', 'Middle', 'Lower')),
       survived = factor(survived, levels = c(0, 1), labels = c('No', 'Yes')),
       age = as.numeric(age),
       fare = as.numeric(fare))

syl9005 · April 10, 2020, 8:53am

Thanks for the help
but your solution has the error
And I did class() but it was already numeric

siddharthprabhu · April 10, 2020, 9:38am

I'm fairly certain that your age and fare variables are not numeric before mutate(). I'm able to generate the plot just fine after casting them to the proper data type.

Here is the whole reprex.

library(dplyr, warn.conflicts = FALSE)
library(rpart)
library(rpart.plot)

path <- 'https://raw.githubusercontent.com/guru99-edu/R-Programming/master/titanic_data.csv'
titanic <- read.csv(path, stringsAsFactors = FALSE)

titanic %>% 
  select(age, fare) %>% 
  glimpse()
#> Rows: 1,309
#> Columns: 2
#> $ age  <chr> "29", "0.9167", "2", "30", "25", "48", "63", "39", "53", "71",...
#> $ fare <chr> "211.3375", "151.55", "151.55", "151.55", "151.55", "26.55", "...

clean_titanic <- titanic %>% 
  select(-home.dest, -cabin, -name, -x, -ticket) %>% 
  mutate(pclass = factor(pclass, levels = c(1, 2, 3), labels = c('Upper', 'Middle', 'Lower')),
         survived = factor(survived, levels = c(0, 1), labels = c('No', 'Yes')),
         age = as.numeric(age),
         fare = as.numeric(fare))
#> Warning: NAs introduced by coercion

#> Warning: NAs introduced by coercion

clean_titanic %>% 
  select(age, fare) %>% 
  glimpse()
#> Rows: 1,309
#> Columns: 2
#> $ age  <dbl> 29.0000, 0.9167, 2.0000, 30.0000, 25.0000, 48.0000, 63.0000, 3...
#> $ fare <dbl> 211.3375, 151.5500, 151.5500, 151.5500, 151.5500, 26.5500, 77....

fit <- rpart(survived ~ ., data = clean_titanic, method = 'class')

rpart.plot(fit)

^{Created on 2020-04-10 by the reprex package (v0.3.0)}

Note: I forgot to mention that I also added stringsAsFactors = FALSE to your read.csv() call. But even without this parameter, your age and fare variables would be read in as factors; not numeric (due to the presence of "?" in some rows).

syl9005 · April 13, 2020, 1:11am

I really appreciate your help.
I remove everything and retried as you told me, and the result successfully turned out Thanks alot!

ps. what is %>% for?
I've seen that a lot or that but couldn't figure out what %>% for.

siddharthprabhu · April 13, 2020, 4:31am

%>% is known as the pipe operator. It passes the expression on its left-hand side as the first argument to the expression on its right-hand side. For example, these two pieces of code are equivalent.

library(dplyr, warn.conflicts = FALSE)

filter(mtcars, mpg > 30)
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> 1 32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
#> 2 30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
#> 3 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
#> 4 30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

mtcars %>% filter(mpg > 30)
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> 1 32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
#> 2 30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
#> 3 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
#> 4 30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

^{Created on 2020-04-13 by the reprex package (v0.3.0)}

Now that may not seem like a big deal until you start doing multiple transformations.

library(dplyr, warn.conflicts = FALSE)

summarize(group_by(filter(mtcars, mpg > 30), gear), avg_disp = mean(disp))
#> # A tibble: 2 x 2
#>    gear avg_disp
#>   <dbl>    <dbl>
#> 1     4     75.2
#> 2     5     95.1

mtcars %>% 
  filter(mpg > 30) %>% 
  group_by(gear) %>% 
  summarize(avg_disp = mean(disp))
#> # A tibble: 2 x 2
#>    gear avg_disp
#>   <dbl>    <dbl>
#> 1     4     75.2
#> 2     5     95.1

^{Created on 2020-04-13 by the reprex package (v0.3.0)}

Notice the way piped code is written mimics the order in which we would expect those operations to happen. You could even read it as "take the mtcars data set, first filter(), then group_by() and then summarize()". The non-piped code takes more effort to understand because it's nested and you need to read it from the inside out to understand what's happening.

Nearly all tidyverse functions take data as their first argument so they work very naturally with %>% allowing you to write easy-to-understand code.

system · April 20, 2020, 4:45am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.