Hi,
I'm new to R and I've gone through some examples of Decision Trees taken from existing, build in data.
Now I would like to apply that to my data file taken from our SQL database.
My starting point is preparing a file:
I am not 100% certain I get your question right, and your code is not exactly reproducible.
But in general - when you have as a resource an external, linked table on a SQL server you can make it internal by calling dbplyr::collect(). This will make the result a regular data frame, usable by other R tools - such as decision trees (as implemented by rpart package, or others).
To see it in use see example taken from https://github.com/jlacko/babisobot - it downloads a table of 182 240 tweets about the Czech prime minister, which takes a while - incidentally demonstrating the benefit of filtering the records at server side = before making the collect() call.
library(tidyverse)
library(DBI)
library(dbplyr)
library(RPostgreSQL)
myDb <- dbConnect(dbDriver('PostgreSQL'),
host = "db.jla-data.net",
port = 5432,
dbname = "dbase",
user = "babisobot", # user babisobot má pouze select práva...
password = "babisobot") # ... a proto jeho heslo může být na netu
tweet_data <- tbl(myDb, "babisobot") %>%
collect()
dbDisconnect(myDb) # clean up & close the door :)
You called collect() with empty brackets outside of a pipe; it did not have any arguments and so it failed. In my example it was called with empty brackets from inside a pipe, which is different thing.
Without delving into details of the magrittr pipe operator try this instead:
my.local.data <- collect(my.data)
You should end up with a local data frame, which you can then pass on to party or rpart or whatever package for decision trees you prefer (I go by rpart, but I am not a zealot).
# Create the input data frame.
input.dat <- Belgium.CurrentHY.data[c(1:1000),]
# Give the chart file a name.
png(file = "decision_tree.png")
# Create the tree.
output.tree <- ctree(
A2 ~ B1 + C1 + D1 + E1 + F1 + G1,
data = input.dat)
# Plot the tree.
plot(output.tree, type="simple")
Now I have last two questions. I used this template:
input.dat <- Belgium.CurrentHY.data[c(1:1000),]
but how can I use the entire available range?
Also, I used variable A2 (int, responses from 1 to 10) but I would like to recode it into A2TB (values 9-10 as 1 and values 1-8 as 2). How can I do that?