In a nutshell, I need to be able to run a document term matrix from a Twitter dataset within an XGBoost classifier. I have completed the document term matrix, but I am missing some key part of preparing the DTM and putting it in a format that the model will accept. I know that you have to convert the DYM back to a data frame, and then you have to create the "training" and "testing" partitions. Can someone put me on the right track as far as the code that I am missing?
Here is the code for the Natural Language Processing part:
setwd('C:/rscripts/random_forest') dataset = read.csv('tweets_all.csv', stringsAsFactors = FALSE) library(tm) corpus <- iconv(dataset$text, to = "utf-8") corpus <- Corpus(VectorSource(corpus)) inspect(corpus[1:5]) corpus <- tm_map(corpus, tolower) corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, removeNumbers) cleanset <- tm_map(corpus, removeWords, stopwords('english')) #my_custom_stopwords <- c("Ã¢â¬\u009dpotus", "Ã\u009dÃ", "Ã\u009djoebiden", "Ã\u009dand", "Ã\u009dhillary", "Ã¢â¬Å¾Ã") #cleanset <- tm_map(corpus, removeWords, my_custom_stopwords) removeURL <- function(x) gsub('http[[:alnum:]]*', '', x) cleanset <- tm_map(cleanset, content_transformer(removeURL)) cleanset <- tm_map(cleanset, stripWhitespace) cleanset <- tm_map(cleanset, removeWords, c('Ã\u009dhillary','Ã¢â¬Å¾Ã','Ã¢â¬Å¡Ã','just','are','all','they')) tdm <- TermDocumentMatrix(cleanset) tdm <- as.matrix(tdm)
Here is the code that I found to use for the XGBoost classification model. It is currently written to accommodate a different dataset (i.e. mushroom data), but I was going to recycle this code to use with my document term matrix from my text mining.
library(caret) library(xgboost) install.packages('e1071', dependencies=TRUE) mushroom_data$cap.shape = as.factor(mushroom_data$cap.shape) newvars <- dummyVars( ~ cap.shape + cap.surface + cap.color + bruises + odor + gill.attachment + gill.spacing + gill.size + gill.color + stalk.shape + stalk.root + stalk.surface.above.ring + stalk.surface.below.ring + stalk.color.above.ring + stalk.color.below.ring+ veil.color + ring.number + ring.type + spore.print.color+ population+ habitat ,data=mushroom_data) newvars <- predict(newvars, mushroom_data) cv.ctrl <- trainControl(method = "repeatedcv", repeats = 1,number = 4,allowParallel=T) xgb.grid <- expand.grid(nrounds = 40,eta = c(0.5,1),max_depth = c(7,10),gamma = c(0,0.2),colsample_bytree=c(1),min_child_weight=c(0.1,0.9)) xgb_tune <-train( newvars , mushroom_data$class, method="xgbTree", trControl=cv.ctrl, tuneGrid=xgb.grid ) pred = predict(xgb_tune,newvars) mushroom_data$labels[mushroom_data$class=="e"] = 1 mushroom_data$labels[mushroom_data$class=="p"] = 0 mtrain <- xgb.DMatrix(data = newvars , label = as.matrix(mushroom_data$labels)) result_model <- xgboost(data = mtrain ,max_depth = 7, eta = 1, nthread = 4, nrounds = 40, objective = "reg:logistic", verbose = 1) pred <- predict(result_model, mtrain) pred[pred<0.5] = 0 pred[pred>0.5] = 1 joined = data.frame(mushroom_data$labels,pred) joined_diff = joined$mushroom_data.labels- joined$pred sum(joined_diff)
Can someone please show me how to place my DTM appropriately into the XGBoost code? I can provide the dataset as well.