I'm having difficulties in how to put an idea into practice using classifiers. I am working with variables related to an infectious disease where I am calculating the probability of the disease occurring in a certain location and would like to improve my current accuracy, precipitation and F1-measure results. I obtained data for this disease from a city, where the cases reported in each neighborhood were reported monthly between the years 2000 to 2003. I also used climatic variables (index of vegetation, precipitation and temperature). My dataset in the case is in the format (Neighborhood, Month/Year, Cases, EVI, Precipitation, temperature). The variable Cases I turned into a categorical variable, where “Yes” indicates the neighborhoods that had at least one case, and “No” indicates no reported case.
In my R implementations, I'm applying five classifiers (Random Forest, LDA, Decision Tree, Bayesian generalized linear models and Naive Bayes), my dependent variable is Cases and the others are independent. But I got very low precision, F1 and Recall results (less than 30%).
Here the part code used in R
data <- dplyr::select(original_data, EVI, Precip, Temperature, Humidity, CasesBin2)
set.seed(123)
train <- createDataPartition(data$CasesBin2,
p = 0.85, # % of data going to training
teams = 1,
list = F)
train.orig <- data[ train,]
test <- data[-train,]
#THE. Global options that we will use in all our trained models
ctrl <- trainControl(method = "CV",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
# Naive Bayes: original data
nb_orig_start <- Sys.time()
nb_orig <- train(CasesBin2 ~ .,
date = train.orig,
method = "naive_bayes",
trControl = ctrl,
metric = "ROC")
nb_orig_end <- Sys.time()
nb_orig_runtime <- nb_orig_end - nb_orig_start
nb_orig_runtime
nb_orig_train_pred <- predict(nb_orig,train.orig,type = "prob")
nb_orig_train <- factor(ifelse(nb_orig_train_pred$Yes > 0.8,"Yes","No"))
confusionMatrix(nb_orig_train, getElement(train.orig,'Bin2 Cases'), positive="Yes")
#F. Random Forest: original data
rf_orig_start <- Sys.time()
rf_orig <- train(CasesBin2 ~ .,
date = train.orig,
method = "rf",
trControl = ctrl,
metric = "ROC")
rf_orig_end <- Sys.time()
rf_orig_runtime <- rf_orig_end - rf_orig_start
rf_orig_runtime
rf_orig_train_pred <- predict(rf_orig,train.orig,type = "prob")
rf_orig_train <- factor(ifelse(rf_orig_train_pred$Yes > 0.8,"Yes","No"))
confusionMatrix(rf_orig_train, getElement(train.orig,'Bin2 Cases'), positive="Yes")
#################################################
#Naive Bayes Model - Test on original dataset#
#################################################
#A. NB Model predictions
nb_orig_pred_start <- Sys.time()
nb_orig_pred <- predict(nb_orig,test,type = "prob")
#B. NB - Assign class to probabilities
nb_orig_test <- factor(ifelse(nb_orig_pred$Yes> 0.8,"Yes","No"))
nb_orig_pred_end <- Sys.time()
nb_orig_pred_runtime <- nb_orig_pred_end - nb_orig_pred_start
nb_orig_pred_runtime
confusionMatrix(nb_orig_test, getElement(test,'CasosBin2'), positive="Yes")
#C. NB Save Precision/Recall/F
precision_nbOrig <- posPredValue(nb_orig_test,test$CasosBin2,positive = "Yes")
recall_nbOrig <- sensitivity(nb_orig_test,test$CasosBin2,positive = "Yes")
F1_nbOrig <- (2 * precision_nbOrig * recall_nbOrig) / (recall_nbOrig + precision_nbOrig)
#########################################
#Random Forest Model - Test on original dataset#
#########################################
#A. LR Model predictions
rf_orig_pred_start <- Sys.time()
rf_orig_pred <- predict(rf_orig,test,type = "prob")
#B. RF - Assign class to probabilities
rf_orig_test <- factor(ifelse(rf_orig_pred$Yes> 0.8,"Yes","No"))
rf_orig_pred_end <- Sys.time()
rf_orig_pred_runtime <- rf_orig_pred_end - rf_orig_pred_start
rf_orig_pred_runtime
confusionMatrix(rf_orig_test, getElement(test,'CasosBin2'), positive="Yes")
#C. RF Save Precision/Recall/F
precision_rfOrig <- posPredValue(rf_orig_test,test$CasosBin2,positive = "Yes")
recall_rfOrig <- sensitivity(rf_orig_test,test$CasosBin2,positive = "Yes")
F1_rfOrig <- (2 * precision_rfOrig * recall_rfOrig) / (recall_rfOrig + precision_rfOrig)
The idea to try to improve results is to deal with two issues:
I think neighborhood relationships between neighborhoods would make differences in my metrics. It would be interesting to define a distance metric between neighborhoods (Euclidean distance between centers, or a binary variable that is 1 if it is a neighbor and 0 if not, or some other metric) and input the fact that a neighboring neighborhood has or not having had dengue in the previous instant of time. However, I don't know how I would do this in my ranking model.
I'm just considering static relationships between the data: giving the values of the input variables in the month to predict the number of cases in the month itself. But I would also like to explore dynamic temporal relationships. Most likely, the number of cases at an instant of time depends on past climatological variables, given the disease cycle. I thought of including in the model, as additional input variables, past weather information. But I don't know how to do it either.