Using a Naive Bayes Classifier to Classify New Text


Please I have built a Naive Bayes text classification model, using the SMS ham/spam dataset. Now, I want to classify a new text as either spam or ham, using the predict() function. But when I run the code, I get the same result over and over again, even though I achieved a model accuracy of up 97%. How can I classify new text? My code is shown below. I actually replicated an example I saw online:

## Rename columns
colnames(sms)[c(1:2)] <- c("Type","Text")

## Delete unwanted columns
sms <- sms[,-c(3:5)]

## Convert type to a factor variable
sms$Type <- as.factor(sms$Type)


## Transform text data
sms_corpus <- Corpus(VectorSource(sms$Text))

## Inspect corpus

## Clean corpus
mystopwords <- readLines("stopwords.txt")

sms_corpus <- sms_corpus %>% 
  tm_map(removeNumbers) %>% 
  tm_map(removePunctuation) %>% 
  tm_map(tolower) %>% 
  tm_map(stripWhitespace) %>% 
  tm_map(removeWords,stopwords("english")) %>% 
  tm_map(removeWords,mystopwords) %>% 
  tm_map(stemDocument) # Stem document

sms_dtm <- DocumentTermMatrix(sms_corpus) #Not Term Document Matrix

## Split data into train and test set
smsTrain <- sms_dtm[1:4180,]
smsTest <- sms_dtm[4181:5559,]

## Save vector labeling rows
smsTrain_labels <- sms[1:4180,]$Type
smsTest_labels <- sms[4181:5559,]$Type


## Remove words from the matrix that appear less than 5 times
sms_freq_words <- findFreqTerms(smsTrain,5)

## Limit matrix to only include words in the frequency vector
smsTrain_freq <- smsTrain[,sms_freq_words]
smsTest_freq <- smsTest[,sms_freq_words]

## Convert matrix to "yes" and "no" categorical variable
convert <- function(x){
  result <- ifelse(x > 0,"Yes","No")

## Apply to data
sms_train <- apply(smsTrain_freq,2,convert)
sms_test <- apply(smsTest_freq,2,convert)


## Build model
sms_classifier <- naiveBayes(sms_train,smsTrain_labels)

## Predict with model
sms_pred <- predict(sms_classifier,sms_test)

## Evaluate prediction
CrossTable(sms_pred,smsTest_labels,prop.chisq = FALSE,
           prop.t = FALSE,dnn = c("Predicted","Actual"))


Now when I try to predict with new data, I get the same outcome "ham", even when the text is obviously "spam".

newdf <- data.frame(
  Text = c("WINNER!! As a valued network customer you have been selected to receivea å£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.")

newpred <- predict(sms_classifier,newdf)

Do I need to preprocess the new text as well before the model can correctly classify it? Thanks in anticipation of your advice.

