Naive Bayes exercise not working as should.

Hi,

I'm playing around with a dummy dataset with locations as well as time / day information also available. I am reverse engineering an exercise from a website but I am unable to recreate the expected results on my local machine, I keep getting an error that says:

Warning: predict.naive_bayes(): only 1 feature(s) out of 2 defined in the naive_bayes object "locmodel" are used for prediction.
Warning: predict.naive_bayes(): more features in the newdata are provided as there are probability tables in the object. Calculation is performed based on features to be found in the tables.
Error: predict.naive_bayes():
1 feature is discrete, and compared to the corresponding probability table it misses some levels or has more levels.
Other possibility: there is type mismatch between training data and newdata (for instance, some variable should be numeric but is character/factor).

I did not know the best way to share the data, therefore I uploaded it to kaggle. I was going to try and incorporate in the code downloading the dataset directly from kaggle but I was unsure on how to proceed, hence I'll just share the link in hope you can download it.

Locations_Dummy_NaiveBayes | Kaggle

library(naivebayes)
library(tidyverse)

locations <- read_csv("locations.csv")

# The exercise contains  two "objects" that were loaded in the
# environment, therefore I had to try reverse engineer 
# the exercise and these objects.
# The objects in question are: 
# weekend_evening & weekend_afternoon 

# To determine what these objects were on the online R console 
# I ​inspected their classes & just calling them to see what came up,
# below is the online R console output: 

# > weekend_afternoon
#   daytype  hourtype location
# 85 weekend afternoon     home

# > class(weekend_afternoon)
# [1] "data.frame"

# > weekend_evening
#  daytype hourtype location
# 91 weekend  evening     home

# >class(weekend_evening)
# [1] "data.frame"

# Based on the console output I deduced that these were "simple" data frames, 
# which I could quickly build as seen below:

weekday_afternoon <- tibble(
 ​datatype = "weekday",
 ​hourtype = "aternoon",
 ​location = "office"
) %>% mutate(datatype = as.factor(datatype),
            ​hourtype = as.factor(hourtype),
            ​location = as.factor(location))

weekday_evening <- tibble(
 ​datatype = "weekday",
 ​hourtype = "evening",
 ​location = "home"
) %>% mutate(datatype = as.factor(datatype),
            ​hourtype = as.factor(hourtype),
            ​location = as.factor(location))


# Build a NB model of location
locmodel <- naive_bayes(location ~ daytype + hourtype, data = locations)

# Predict my location on a weekday afternoon
predict(locmodel, weekday_afternoon)

# Predict my location on a weekday evening
predict(locmodel, weekday_evening)

I believe that I was recreating successfully the exercises behavior, nonetheless I keep getting the described error.

What I found intriguing was that even though these objects weekend_afternoon & weekend_evening despite when called upon, its output appears to have a single observation with 3 variables. Also, how can these df having a single observation and have more than what the observed factor levels. Maybe to work I need to set these factor levels as well to these df, if so how? I thought that the factors were based on existing observations for those variables, can I stipulate all the different factor levels regardless if that observation even exists in the data frame?

# Don't run this part since this output comes from the online console.
str(weekend_evening)
'data.frame':	1 obs. of  3 variables:
 $ daytype : Factor w/ 2 levels "weekday","weekend": 2
 $ hourtype: Factor w/ 4 levels "afternoon","evening",..: 2
 $ location: Factor w/ 7 levels "appointment",..: 3

# Calling weeken_evening outputs only this:
> weekend_evening
   daytype hourtype location
91 weekend  evening     home

str(weekend_afternoon)
'data.frame':	1 obs. of  3 variables:
 $ daytype : Factor w/ 2 levels "weekday","weekend": 2
 $ hourtype: Factor w/ 4 levels "afternoon","evening",..: 1
 $ location: Factor w/ 7 levels "appointment",..: 3

# Calling weekend_afternoon ouputs only this:
> weekend_afternoon
   daytype  hourtype location
85 weekend afternoon     home

How can weekend_evening & weekend_afternoon be a data frames with a single observation and have many factor levels? This is what I believe might be the issue, otherwise I have no clue on why I get this error.

Thanks for your time.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.