Decision trees using odbc databases

Hi,
I'm new to R and I've gone through some examples of Decision Trees taken from existing, build in data.
Now I would like to apply that to my data file taken from our SQL database.
My starting point is preparing a file:

library(RODBC)
abc <- odbcConnect("sqldatabase")

my.data <- sqlQuery(abc, "SELECT * FROM sqldatatable")
Belgium.data <- my.data[my.data$CountryID == 15, ]
Belgium.CurrentHY.data <- subset(Belgium.data, InterviewDate > "2018-04-01" & InterviewDate < "2018-09-30")

I can display it without any problems:

str(Belgium.CurrentHY.data)
$ CountryID      : int  15 15 15 15 15 15 15 15 15 15 ... $ InterviewDate  : POSIXct, format: "2018-04-25 08:12:00" "2018-04-26 13:05:00" "2018-04-04 17:28:00" "2018-04-10 12:12:00" ...
$ A2 : int 9 10 10 8 10 9 10 10 9 10 ...
$ B1 : int 10 10 8 7 10 8 9 10 8 10 ...
$ C1 : int 10 10 9 8 10 9 10 9 9 9 ...

so Belgium.CurrentHY.data exists

Now, I would like to:

  1. Recode variable A2 (values from 1 to 10) into A2TB (values 9-10 as 1 and values 1-8 as 2)
  2. Use this recoded A2TB variable as decision tree target (to see proportions 1 to 2) with other variables such as B1 and C1

How can I do that? Do I need to created a data frame first? Examples I have gone through are based on following commands:

Example 1:


library("party")
str(iris)
iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)
print(iris_ctree)
plot(iris_ctree)

Example 2

library(ISLR)
data(package="ISLR")
carseats<-Carseats
require(tree)
names(carseats)
hist(carseats$Sales)
High = ifelse(carseats$Sales<=8, "No", "Yes")
carseats = data.frame(carseats, High)
tree.carseats = tree(High~.-Sales, data=carseats)
summary(tree.carseats)
plot(tree.carseats)
text(tree.carseats, pretty = 0)

Now, I would like the above (or different commands) to create my own decision trees based on Belgium.CurrentHY.data.

Can you help?

Slavek

1 Like

I am still hoping there are some specialists on this forum who could help...

I am not 100% certain I get your question right, and your code is not exactly reproducible.

But in general - when you have as a resource an external, linked table on a SQL server you can make it internal by calling dbplyr::collect(). This will make the result a regular data frame, usable by other R tools - such as decision trees (as implemented by rpart package, or others).

To see it in use see example taken from https://github.com/jlacko/babisobot - it downloads a table of 182 240 tweets about the Czech prime minister, which takes a while - incidentally demonstrating the benefit of filtering the records at server side = before making the collect() call.

library(tidyverse)
library(DBI)
library(dbplyr)
library(RPostgreSQL)

myDb <- dbConnect(dbDriver('PostgreSQL'),
                  host     = "db.jla-data.net",
                  port     = 5432,
                  dbname   = "dbase",
                  user     = "babisobot", # user babisobot má pouze select práva...
                  password = "babisobot") # ... a proto jeho heslo může být na netu

tweet_data <- tbl(myDb, "babisobot") %>%
  collect()

dbDisconnect(myDb) # clean up & close the door :)
2 Likes

Thank you.
I have added your libraries and run my script:

library(RODBC)

library(tidyverse)
library(DBI)
library(dbplyr)
library(RPostgreSQL)

abc <- odbcConnect("livestats")

my.data <- sqlQuery(abc, "SELECT * FROM LIVESTATS.HGSI.RAW_Sales_Flat")

str(my.data)
str(my.data$InterviewDate)
summary(my.data$CountryID)

byCountry <- split(my.data,my.data$CountryID)

collect()

but I can only see a weird error message:

> collect()
Error in UseMethod("collect") : 
  no applicable method for 'collect' applied to an object of class "NULL"
> 

What is the reason? Maybe collect() should not be used in this case?

You called collect() with empty brackets outside of a pipe; it did not have any arguments and so it failed. In my example it was called with empty brackets from inside a pipe, which is different thing.

Without delving into details of the magrittr pipe operator try this instead:


my.local.data <- collect(my.data)

You should end up with a local data frame, which you can then pass on to party or rpart or whatever package for decision trees you prefer (I go by rpart, but I am not a zealot).

Hurray! It's workig!


# Create the input data frame.
input.dat <- Belgium.CurrentHY.data[c(1:1000),]

# Give the chart file a name.
png(file = "decision_tree.png")

# Create the tree.
output.tree <- ctree(
  A2 ~ B1 + C1 + D1 + E1 + F1 + G1, 
  data = input.dat)

# Plot the tree.
plot(output.tree, type="simple")

Now I have last two questions. I used this template:

input.dat <- Belgium.CurrentHY.data[c(1:1000),]

but how can I use the entire available range?

Also, I used variable A2 (int, responses from 1 to 10) but I would like to recode it into A2TB (values 9-10 as 1 and values 1-8 as 2). How can I do that?

I am glad it does :slight_smile:

As for the whole range: just remove the c(1:1000) from your code (input.dat <- Belgium.CurrentHY.data, and let it rest there)

For the recoding the values you likely need dplyr::mutate() which goes a bit out out of scope of the original post.

You might find this cheatsheet helpful; pay special attention to case_when()
https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf

1 Like