Hello,
Please I have built a Naive Bayes text classification model, using the SMS ham/spam dataset. Now, I want to classify a new text as either spam or ham, using the predict()
function. But when I run the code, I get the same result over and over again, even though I achieved a model accuracy of up 97%. How can I classify new text? My code is shown below. I actually replicated an example I saw online:
## Rename columns
colnames(sms)[c(1:2)] <- c("Type","Text")
## Delete unwanted columns
sms <- sms[,-c(3:5)]
## Convert type to a factor variable
sms$Type <- as.factor(sms$Type)
str(sms)
table(sms$Type)
## Transform text data
sms_corpus <- Corpus(VectorSource(sms$Text))
## Inspect corpus
inspect(sms_corpus[1:3])
as.character(sms_corpus[[3]])
## Clean corpus
mystopwords <- readLines("stopwords.txt")
mystopwords
sms_corpus <- sms_corpus %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(tolower) %>%
tm_map(stripWhitespace) %>%
tm_map(removeWords,stopwords("english")) %>%
tm_map(removeWords,mystopwords) %>%
tm_map(stemDocument) # Stem document
sms_dtm <- DocumentTermMatrix(sms_corpus) #Not Term Document Matrix
## Split data into train and test set
smsTrain <- sms_dtm[1:4180,]
smsTest <- sms_dtm[4181:5559,]
## Save vector labeling rows
smsTrain_labels <- sms[1:4180,]$Type
smsTest_labels <- sms[4181:5559,]$Type
prop.table(table(smsTrain_labels))
prop.table(table(smsTest_labels))
## Remove words from the matrix that appear less than 5 times
sms_freq_words <- findFreqTerms(smsTrain,5)
str(sms_freq_words)
## Limit matrix to only include words in the frequency vector
smsTrain_freq <- smsTrain[,sms_freq_words]
smsTest_freq <- smsTest[,sms_freq_words]
## Convert matrix to "yes" and "no" categorical variable
convert <- function(x){
result <- ifelse(x > 0,"Yes","No")
return(result)
}
## Apply to data
sms_train <- apply(smsTrain_freq,2,convert)
sms_test <- apply(smsTest_freq,2,convert)
view(sms_train)
view(sms_test)
## Build model
set.seed(1234)
sms_classifier <- naiveBayes(sms_train,smsTrain_labels)
## Predict with model
sms_pred <- predict(sms_classifier,sms_test)
## Evaluate prediction
CrossTable(sms_pred,smsTest_labels,prop.chisq = FALSE,
prop.t = FALSE,dnn = c("Predicted","Actual"))
confusionMatrix(table(sms_pred,smsTest_labels))
Now when I try to predict with new data, I get the same outcome "ham", even when the text is obviously "spam".
newdf <- data.frame(
Text = c("WINNER!! As a valued network customer you have been selected to receivea å£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.")
)
newpred <- predict(sms_classifier,newdf)
Do I need to preprocess the new text as well before the model can correctly classify it? Thanks in anticipation of your advice.