library(tidyverse) library(tidymodels) library(dplyr) library(janitor) library(patchwork) library(rpart) library(parsnip) library(rpart.plot) theme_set(theme_light()) jumpNgps <- read.csv("JumpGpsWk-2.csv") jumpNgps <- clean_names(jumpNgps) jumpNgps <- lapply(jumpNgps, as.numeric) jumpNgps <- data.frame(jumpNgps) set.seed(487) jump_split <- initial_split(jumpNgps, prop = 3/4, strata = total_metabolic_power) jump_train <- training(jump_split) jump_test <- testing(jump_split) jump_recipe <- recipe( flight_time_contraction_time ~ ., data = jump_train )%>% step_normalize(all_predictors()) jump_recipe%>% prep()%>% bake(new_data = NULL) tree_model <- decision_tree( cost_complexity = tune(), tree_depth = tune(), min_n = tune())%>% set_engine("rpart")%>% set_mode("regression") tree_params <- grid_regular( cost_complexity(), tree_depth(), min_n(), levels = 5) set.seed(136) cv_folds <- vfold_cv(jump_train, v = 5) doParallel::registerDoParallel(cores = 3) set.seed(5867) model_fit <- tune_grid( tree_model, jump_recipe, resamples = cv_folds, grid = tree_params ) collect_metrics(model_fit) autoplot(model_fit) select_best(model_fit, "rmse") tree_final <- finalize_model(tree_model, select_best(model_fit, "rmse")) fit_train <- fit(tree_final, flight_time_contraction_time ~ ., jump_train) library(rpart.plot) fit_train %>% extract_fit_engine() %>% rpart.plot(roundint = FALSE) library(vip) fit_train %>% vip(geom = "col", num_features = 20L, aesthetics = list( color = "black", fill = "palegreen", alpha = 0.5)) fit_test <- last_fit(tree_final, flight_time_contraction_time ~ .,jump_split) fit_test collect_metrics(fit_test) fit_test %>% collect_predictions() %>% ggplot(aes(x = .pred, y = flight_time_contraction_time)) + geom_abline(intercept = 0, slope = 1, lty = 2, size = 1.2, color = "red") + geom_point(size = 3)'
So I'm following a tidyx podcast and also Julia Silge's videos trying to learn how to use machine learning at it's basics.
Currently I am stuck with my decision tree picking a tree depth of 1.
I used this code on a previous data set and had no issues. I recycled the code and now get a weird set of problems. This data set has 1 variable I'm trying to predict and has already been processed in Tableau Prep so it lines up properly with last weeks work values. Therefore no significant processing steps in code. (To reiterate it is this weeks value in the same row as last weeks work, so I'm asking the question of how did last weeks work affect this weeks value)
First attempt - initial dataset offset by 1 week returned a tree depth of 1. I then stratified on a Total Distance variable and that then produced much better results and provided a tree depth of 4 Min N 8.
2nd attempt - same code, exact same pre-processing script just changed by now offsetting by 2 weeks instead of 1 (reduces from 904 records to 892). It now constantly returns a tree depth of 1 and Min N of 2. I tried 4 other variables for stratification and also tried variables with the highest variation and spread for the quantile range to encompass more. I then also added an individual ID# for the records (like ID# for each person because each person has roughly 80 records per this dataset) to stratify by but, they all produce a tree depth of 1 after I clear and re-run the model.
My final questions then are:
Is there something clearly incorrect in my code or coding logic?
Is there potentially not enough records for a machine learning model yet?
Should I be using something very specifically different than decision tree?
In tidymodels can you manually select tree_depth() size?
Any other suggestions for this newbie?