Help I have been going around in circles trying to get this confusion matrix to work. I think it has to do with the partitioning. I have tried multiple ways to partition. Some ways the levels get messed up or all of a particular level of a factor ends up in the test case. I am at a complete loss. I am rather newer to R. Please explain what I am doing wrong so I can fix it or suggest a way to fix it.
> library(leaps) > library(caret) Loading required package: lattice Loading required package: ggplot2 RStudio Community is a great place to get help: https://forum.posit.co/c/tidyverse. > library(dplyr) Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union > studentreport<-read.csv("C:\\Users\\Joseph\\Downloads\\studentreport dataset full imp.csv",header=T,sep=",") > studentreport<-data.frame(studentreport) > > set.seed(123) > smp_size = 7239 > training<- sample_n(studentreport,smp_size) > testing<- setdiff(studentreport,training_data) Error in setdiff_data_frame(x, y) : object 'training_data' not found > testing<- setdiff(studentreport,training) > str(training) 'data.frame': 7239 obs. of 13 variables: $ Enrolling: logi FALSE TRUE TRUE FALSE FALSE FALSE ... $ School : Factor w/ 2480 levels "A C Flora High School",..: 953 1191 1951 354 2159 32 677 8 870 1986 ... $ State : Factor w/ 49 levels "AE","AL","AR",..: 40 40 28 34 38 40 39 40 31 40 ... $ age : int 17 18 19 18 18 18 18 18 18 18 ... $ Gender : Factor w/ 4 levels "Female","Male",..: 1 1 1 2 2 2 1 2 2 1 ... $ Race : Factor w/ 7 levels "A","B","C","D",..: 1 1 1 7 6 4 7 1 1 1 ... $ Major : Factor w/ 62 levels "Accounting","African American Studies",..: 10 11 23 60 38 50 20 55 1 60 ... $ ACT : int 25 21 28 25 25 18 25 25 25 16 ... $ SAT : num 1810 910 1625 1625 1790 ... $ Rank : num 8 132 60 60 60 57 26 60 60 130 ... $ CSize : int 329 397 337 337 337 270 131 337 337 430 ... $ GPA : num 4.88 4.08 4.88 2.87 3.2 ... $ GPAType : Factor w/ 3 levels "not known","Unweighted",..: 3 3 3 3 3 3 3 3 3 3 ... > str(testing) 'data.frame': 2414 obs. of 13 variables: $ Enrolling: logi TRUE FALSE FALSE FALSE FALSE FALSE ... $ School : Factor w/ 2480 levels "A C Flora High School",..: 350 1962 281 2317 423 2013 518 1767 1614 1613 ... $ State : Factor w/ 49 levels "AE","AL","AR",..: 44 34 20 20 20 20 23 31 5 9 ... $ age : int 18 18 18 19 18 18 18 18 19 19 ... $ Gender : Factor w/ 4 levels "Female","Male",..: 1 2 1 1 1 1 2 1 1 1 ... $ Race : Factor w/ 7 levels "A","B","C","D",..: 7 1 1 7 7 1 6 7 1 7 ... $ Major : Factor w/ 62 levels "Accounting","African American Studies",..: 23 10 19 24 10 60 11 60 14 20 ... $ ACT : int 22 25 25 25 25 22 25 25 27 25 ... $ SAT : num 1390 1540 1570 1430 1590 ... $ Rank : num 60 60 60 60 60 60 60 60 60 60 ... $ CSize : int 337 337 337 337 337 337 337 337 337 337 ... $ GPA : num 3.8 3.22 3.4 3.39 3.4 ... $ GPAType : Factor w/ 3 levels "not known","Unweighted",..: 3 2 3 3 3 2 3 3 2 3 ... > fitreport<-glm(Enrolling~.,train,family="binomial") Warning message: glm.fit: fitted probabilities numerically 0 or 1 occurred > itstart=glm(Enrolling~1,data=training,family="binomial") > Fitstart=glm(Enrolling~1,data=training,family="binomial") > > Report<-step(Fitstart,direction="forward",scope=formula(fitreport)) Start: AIC=7463.71 Enrolling ~ 1 Df Deviance AIC + State 48 7186.8 7284.8 + ACT 1 7362.0 7366.0 + Rank 1 7419.7 7423.7 + GPA 1 7443.7 7447.7 + CSize 1 7457.4 7461.4 + GPAType 1 7457.9 7461.9 <none> 7461.7 7463.7 + Gender 3 7455.8 7463.8 + age 1 7460.1 7464.1 + SAT 1 7460.2 7464.2 + Race 6 7452.6 7466.6 + Major 61 7363.5 7487.5 + School 2150 5074.8 9376.8 Step: AIC=7284.83 Enrolling ~ State Df Deviance AIC + Rank 1 7149.0 7249.0 + ACT 1 7149.2 7249.2 + GPA 1 7167.3 7267.3 + CSize 1 7182.6 7282.6 + age 1 7183.4 7283.4 <none> 7186.8 7284.8 + SAT 1 7185.4 7285.4 + Gender 3 7181.4 7285.4 + GPAType 1 7186.4 7286.4 + Race 6 7176.9 7286.9 + Major 61 7089.7 7309.7 + School 2141 5300.4 9680.4 Step: AIC=7248.99 Enrolling ~ State + Rank Df Deviance AIC + ACT 1 7117.9 7219.9 + GPA 1 7143.7 7245.7 + CSize 1 7144.9 7246.9 + age 1 7145.2 7247.2 <none> 7149.0 7249.0 + SAT 1 7147.5 7249.5 + GPAType 1 7148.5 7250.5 + Gender 3 7145.1 7251.1 + Race 6 7140.2 7252.2 + Major 61 7058.0 7280.0 + School 2142 5152.9 9536.9 Step: AIC=7219.89 Enrolling ~ State + Rank + ACT Df Deviance AIC + age 1 7114.4 7218.4 <none> 7117.9 7219.9 + CSize 1 7116.3 7220.3 + SAT 1 7116.4 7220.4 + GPA 1 7116.9 7220.9 + Gender 3 7113.3 7221.3 + GPAType 1 7117.3 7221.3 + Race 6 7108.2 7222.2 + Major 61 7022.6 7246.6 + School 2141 6205.7 10589.7 Step: AIC=7218.37 Enrolling ~ State + Rank + ACT + age Df Deviance AIC <none> 7114.4 7218.4 + CSize 1 7112.7 7218.7 + SAT 1 7112.9 7218.9 + GPA 1 7113.6 7219.6 + GPAType 1 7113.8 7219.8 + Gender 3 7110.2 7220.2 + Race 6 7104.7 7220.7 + Major 61 7019.2 7245.2 + School 2142 8281.6 12669.6 Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: fitted probabilities numerically 0 or 1 occurred 3: glm.fit: algorithm did not converge 4: glm.fit: fitted probabilities numerically 0 or 1 occurred 5: glm.fit: algorithm did not converge 6: glm.fit: fitted probabilities numerically 0 or 1 occurred 7: glm.fit: algorithm did not converge 8: glm.fit: fitted probabilities numerically 0 or 1 occurred > Modelout<-predict(Report,newdata=testing,type="response") > formula(Report) Enrolling ~ State + Rank + ACT + age > confusionMatrix(Modelout,testing$Enrolling,positive=1) Error: `data` and `reference` should be factors with the same levels. > confusionMatrix(Modelout,testing,positive=1) Error: `data` and `reference` should be factors with the same levels. > str(Modelout) Named num [1:2414] 0.186 0.138 0.17 0.185 0.17 ... - attr(*, "names")= chr [1:2414] "1" "2" "3" "4" ... > testresults<- ifelse(Modelout> 0.5,TRUE,FALSE) > confusionMatrix(testresults,testing,positive=1) Error: `data` and `reference` should be factors with the same levels. > confusionMatrix(testresults,testing$Enrolling,positive=1) Error: `data` and `reference` should be factors with the same levels. > confusionMatrix(testresults,testing$Enrolling) Error: `data` and `reference` should be factors with the same levels.