Decision trees using odbc databases

I'm new to R and I've gone through some examples of Decision Trees taken from existing, build in data.
Now I would like to apply that to my data file taken from our SQL database.
My starting point is preparing a file:

abc <- odbcConnect("sqldatabase") <- sqlQuery(abc, "SELECT * FROM sqldatatable") <-[$CountryID == 15, ] <- subset(, InterviewDate > "2018-04-01" & InterviewDate < "2018-09-30")

I can display it without any problems:

$ CountryID      : int  15 15 15 15 15 15 15 15 15 15 ... $ InterviewDate  : POSIXct, format: "2018-04-25 08:12:00" "2018-04-26 13:05:00" "2018-04-04 17:28:00" "2018-04-10 12:12:00" ...
$ A2 : int 9 10 10 8 10 9 10 10 9 10 ...
$ B1 : int 10 10 8 7 10 8 9 10 8 10 ...
$ C1 : int 10 10 9 8 10 9 10 9 9 9 ...

so exists

Now, I would like to:

  1. Recode variable A2 (values from 1 to 10) into A2TB (values 9-10 as 1 and values 1-8 as 2)
  2. Use this recoded A2TB variable as decision tree target (to see proportions 1 to 2) with other variables such as B1 and C1

How can I do that? Do I need to created a data frame first? Examples I have gone through are based on following commands:

Example 1:

iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)

Example 2

High = ifelse(carseats$Sales<=8, "No", "Yes")
carseats = data.frame(carseats, High)
tree.carseats = tree(High~.-Sales, data=carseats)
text(tree.carseats, pretty = 0)

Now, I would like the above (or different commands) to create my own decision trees based on

Can you help?


1 Like

I am still hoping there are some specialists on this forum who could help...

I am not 100% certain I get your question right, and your code is not exactly reproducible.

But in general - when you have as a resource an external, linked table on a SQL server you can make it internal by calling dbplyr::collect(). This will make the result a regular data frame, usable by other R tools - such as decision trees (as implemented by rpart package, or others).

To see it in use see example taken from - it downloads a table of 182 240 tweets about the Czech prime minister, which takes a while - incidentally demonstrating the benefit of filtering the records at server side = before making the collect() call.


myDb <- dbConnect(dbDriver('PostgreSQL'),
                  host     = "",
                  port     = 5432,
                  dbname   = "dbase",
                  user     = "babisobot", # user babisobot má pouze select práva...
                  password = "babisobot") # ... a proto jeho heslo může být na netu

tweet_data <- tbl(myDb, "babisobot") %>%

dbDisconnect(myDb) # clean up & close the door :)

Thank you.
I have added your libraries and run my script:



abc <- odbcConnect("livestats") <- sqlQuery(abc, "SELECT * FROM LIVESTATS.HGSI.RAW_Sales_Flat")


byCountry <- split(,$CountryID)


but I can only see a weird error message:

> collect()
Error in UseMethod("collect") : 
  no applicable method for 'collect' applied to an object of class "NULL"

What is the reason? Maybe collect() should not be used in this case?

You called collect() with empty brackets outside of a pipe; it did not have any arguments and so it failed. In my example it was called with empty brackets from inside a pipe, which is different thing.

Without delving into details of the magrittr pipe operator try this instead: <- collect(

You should end up with a local data frame, which you can then pass on to party or rpart or whatever package for decision trees you prefer (I go by rpart, but I am not a zealot).

Hurray! It's workig!

# Create the input data frame.
input.dat <-[c(1:1000),]

# Give the chart file a name.
png(file = "decision_tree.png")

# Create the tree.
output.tree <- ctree(
  A2 ~ B1 + C1 + D1 + E1 + F1 + G1, 
  data = input.dat)

# Plot the tree.
plot(output.tree, type="simple")

Now I have last two questions. I used this template:

input.dat <-[c(1:1000),]

but how can I use the entire available range?

Also, I used variable A2 (int, responses from 1 to 10) but I would like to recode it into A2TB (values 9-10 as 1 and values 1-8 as 2). How can I do that?

I am glad it does :slight_smile:

As for the whole range: just remove the c(1:1000) from your code (input.dat <-, and let it rest there)

For the recoding the values you likely need dplyr::mutate() which goes a bit out out of scope of the original post.

You might find this cheatsheet helpful; pay special attention to case_when()

1 Like