Using a Naive Bayes Classifier to Classify New Text


Please I have built a Naive Bayes text classification model, using the SMS ham/spam dataset. Now, I want to classify a new text as either spam or ham, using the predict() function. But when I run the code, I get the same result over and over again, even though I achieved a model accuracy of up 97%. How can I classify new text? My code is shown below. I actually replicated an example I saw online:

## Rename columns
colnames(sms)[c(1:2)] <- c("Type","Text")

## Delete unwanted columns
sms <- sms[,-c(3:5)]

## Convert type to a factor variable
sms$Type <- as.factor(sms$Type)


## Transform text data
sms_corpus <- Corpus(VectorSource(sms$Text))

## Inspect corpus

## Clean corpus
mystopwords <- readLines("stopwords.txt")

sms_corpus <- sms_corpus %>% 
  tm_map(removeNumbers) %>% 
  tm_map(removePunctuation) %>% 
  tm_map(tolower) %>% 
  tm_map(stripWhitespace) %>% 
  tm_map(removeWords,stopwords("english")) %>% 
  tm_map(removeWords,mystopwords) %>% 
  tm_map(stemDocument) # Stem document

sms_dtm <- DocumentTermMatrix(sms_corpus) #Not Term Document Matrix

## Split data into train and test set
smsTrain <- sms_dtm[1:4180,]
smsTest <- sms_dtm[4181:5559,]

## Save vector labeling rows
smsTrain_labels <- sms[1:4180,]$Type
smsTest_labels <- sms[4181:5559,]$Type


## Remove words from the matrix that appear less than 5 times
sms_freq_words <- findFreqTerms(smsTrain,5)

## Limit matrix to only include words in the frequency vector
smsTrain_freq <- smsTrain[,sms_freq_words]
smsTest_freq <- smsTest[,sms_freq_words]

## Convert matrix to "yes" and "no" categorical variable
convert <- function(x){
  result <- ifelse(x > 0,"Yes","No")

## Apply to data
sms_train <- apply(smsTrain_freq,2,convert)
sms_test <- apply(smsTest_freq,2,convert)


## Build model
sms_classifier <- naiveBayes(sms_train,smsTrain_labels)

## Predict with model
sms_pred <- predict(sms_classifier,sms_test)

## Evaluate prediction
CrossTable(sms_pred,smsTest_labels,prop.chisq = FALSE,
           prop.t = FALSE,dnn = c("Predicted","Actual"))


Now when I try to predict with new data, I get the same outcome "ham", even when the text is obviously "spam".

newdf <- data.frame(
  Text = c("WINNER!! As a valued network customer you have been selected to receivea å£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.")

newpred <- predict(sms_classifier,newdf)

Do I need to preprocess the new text as well before the model can correctly classify it? Thanks in anticipation of your advice.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.