Using a Naive Bayes Classifier to Classify New Text

iFeanyi · August 17, 2021, 2:57pm

Hello,

Please I have built a Naive Bayes text classification model, using the SMS ham/spam dataset. Now, I want to classify a new text as either spam or ham, using the predict() function. But when I run the code, I get the same result over and over again, even though I achieved a model accuracy of up 97%. How can I classify new text? My code is shown below. I actually replicated an example I saw online:

## Rename columns
colnames(sms)[c(1:2)] <- c("Type","Text")

## Delete unwanted columns
sms <- sms[,-c(3:5)]

## Convert type to a factor variable
sms$Type <- as.factor(sms$Type)

str(sms)
table(sms$Type)

## Transform text data
sms_corpus <- Corpus(VectorSource(sms$Text))

## Inspect corpus
inspect(sms_corpus[1:3])
as.character(sms_corpus[[3]])

## Clean corpus
mystopwords <- readLines("stopwords.txt")
mystopwords

sms_corpus <- sms_corpus %>% 
  tm_map(removeNumbers) %>% 
  tm_map(removePunctuation) %>% 
  tm_map(tolower) %>% 
  tm_map(stripWhitespace) %>% 
  tm_map(removeWords,stopwords("english")) %>% 
  tm_map(removeWords,mystopwords) %>% 
  tm_map(stemDocument) # Stem document

sms_dtm <- DocumentTermMatrix(sms_corpus) #Not Term Document Matrix

## Split data into train and test set
smsTrain <- sms_dtm[1:4180,]
smsTest <- sms_dtm[4181:5559,]

## Save vector labeling rows
smsTrain_labels <- sms[1:4180,]$Type
smsTest_labels <- sms[4181:5559,]$Type

prop.table(table(smsTrain_labels))
prop.table(table(smsTest_labels))

## Remove words from the matrix that appear less than 5 times
sms_freq_words <- findFreqTerms(smsTrain,5)
str(sms_freq_words)

## Limit matrix to only include words in the frequency vector
smsTrain_freq <- smsTrain[,sms_freq_words]
smsTest_freq <- smsTest[,sms_freq_words]

## Convert matrix to "yes" and "no" categorical variable
convert <- function(x){
  result <- ifelse(x > 0,"Yes","No")
  return(result)
}

## Apply to data
sms_train <- apply(smsTrain_freq,2,convert)
sms_test <- apply(smsTest_freq,2,convert)

view(sms_train)
view(sms_test)

## Build model
set.seed(1234)
sms_classifier <- naiveBayes(sms_train,smsTrain_labels)

## Predict with model
sms_pred <- predict(sms_classifier,sms_test)

## Evaluate prediction
CrossTable(sms_pred,smsTest_labels,prop.chisq = FALSE,
           prop.t = FALSE,dnn = c("Predicted","Actual"))

confusionMatrix(table(sms_pred,smsTest_labels))

Now when I try to predict with new data, I get the same outcome "ham", even when the text is obviously "spam".

newdf <- data.frame(
  Text = c("WINNER!! As a valued network customer you have been selected to receivea å£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.")
)

newpred <- predict(sms_classifier,newdf)

Do I need to preprocess the new text as well before the model can correctly classify it? Thanks in anticipation of your advice.

system · September 7, 2021, 2:58pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.