Keep getting error that contrasts can only be applied to factors with 2 or more levels, even though every factor does have more than 2 levels and are not NA values

Hi,
I am attempting to find the 95% CI of the c-statistic for ROC curves using bootstrapping. However, I keep getting the following error when I run this code:

FUN = function(Excluded_data, i){
fit = glm(LOS_quartiles ~ PH_Z + AGE_SURGERY + SEX + RACE + ETHNICITY + MARITAL_STATUS +INSURANCE, data = Excluded_data[i,], family = "binomial")
DescTools::Cstat(fit)
}

res = boot(Excluded_data, FUN, R=999)
boot.ci(boot.out = res, type = "perc")

Error: contrasts can be applied only to factors with 2 or more levels.

However, all of the factors and continuous variables (AGE_SURGERY and PH_Z) have more than 1 unique value and I have filtered out NA values. How can I resolve this issue?

Any suggestions would be much appreciated. Thanks in advance.

please provide reproducible code.
some bootstrap samples may only have one factor levels unless you stratify / guard against it.
Also how big is you sample size ?

Hi, thanks for the reply. I have updated the post with formatted code. The 'Excluded_data' variable is the dataset, and the variables in the glm command are either continuous (PH_Z, AGE_SURGERY) or categorical (all others). They all have more than 1 unique value when I check with the 'unique' function. Some variables have 2 levels, being 0 or 1.

The sample size is 150.

Thanks again

I cannot help if I cannot reproduce on my end so you should at a minimum produce a minimal data that reproduce your error or use an R built-in data that shows the problem
see here it might be NA or something else going on in your fitting

Hi, I have uploaded the required data variables to github. Please find it in the following link: GitHub - renren17/dataset

Any further help would be much appreciated. Thanks in advance.

You seem to be sharing two files, both with unknown data types.

Apologies, I have used the 'write.csv' function to convert the data into a vector format - the first row represents the variable names and each column of values represents the data for that variable. The 'Excluded_data_variable' includes all the required variables for the above code apart from the 'PHZ variable' which I have uploaded separately. That is why there are 2 files.

if phz that is referenced in your glm comes from a different data set than the rest of the data sex, race etc; . i.e. you have a method of combining the data in your two files to make a single Excluded_data, which seems to be the sole basis of input to the code you want help with; it would make more sense to me that you simply use saveRDS() on your excluded_data, and share that one dataset.

I have now updated the 'Excluded_data' set to include the PH_Z variable. Thanks in advance.

LOS_quartiles is not a named column in 'Excluded_data' but seems to be required ?

I've updated it now to include 'LOSquartiles'. Thanks in advance.

I ran your code, and did not see your error.

library(tidyverse)
library(DescTools)
library(boot)
Excluded_data <- read_csv("Excluded_data.csv")

FUN = function(Excluded_data, i){
  fit = glm(LOS_quartiles ~ PH_Z + AGE_SURGERY + SEX + RACE + ETHNICITY + MARITAL_STATUS +INSURANCE, data = Excluded_data[i,], family = "binomial")
  DescTools::Cstat(fit)
}

res = boot(Excluded_data, FUN, R=999)
boot.ci(boot.out = res, type = "perc")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates

CALL : 
boot.ci(boot.out = res, type = "perc")

Intervals : 
Level     Percentile     
95%   ( 0.6293,  0.8220 )  
Calculations and Intervals on Original Scale
1 Like