Error Message I don't Understand.

I ran a GAM with on categorical variable and one continuous variable. I am getting this error when trying to graph the function:

Error in model.frame.default(Terms[[i]], data, xlev = object$xlevels) :
factor factor(Type) has new level 1

I do not understand what it means even though I have looked at other comments is various communities.

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

I am working on a project that is using 5 hour ms time periods (continuous predictor) to determine the H2S concentration (continuous response). I want to determine if the type of experiment (with a sponge or without) Has an effect on the uptake rate of H2S concentration. I attempted to run an ANCOVA. But my data was not normal. I have attached my code.

>class(Average5HourDropData$Type)
>attach (Average5HourDropData)
>boxplot (Concentration ~ Type)

#Does not look normal. even after transformations. 

#Now doing GAM 
>library('mgcv')
>gam1<-gam(Concentration~s(Time)+factor(Type),data = Average5HourDropData)
>gam2 <- gam (Concentration~s(Time)+ s(Time, by=factor(Type)) + factor(Type), data=Average5HourDropData)

>AIC(gam1,gam2)

#df      AIC
#gam1 11.92811 204265.6
#gam2 20.99418 116060.9

>summary(gam2)

#Family: gaussian 
#Link function: identity 
#Formula:
#Concentration ~ s(Time) + s(Time, by = factor(Type)) + factor(Type)
#Parametric coefficients:
#Estimate Std. Error t value Pr(>|t|)    
#(Intercept)        43.52397    0.01777    2449   <2e-16 ***
#factor(Type)Sponge 38.27730    0.02513    1523   <2e-16 ***
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Approximate significance of smooth terms:
#edf Ref.df     F p-value    
#s(Time)                     0.6667 0.6667  1471  <2e-16 ***
#s(Time):factor(Type)Control 8.6618 8.6667  1253  <2e-16 ***
#s(Time):factor(Type)Sponge  8.6657 8.6667 11594  <2e-16 ***
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Rank: 28/29
#R-sq.(adj) =  0.994   Deviance explained = 99.4%
#GCV = 4.2744  Scale est. = 4.2713    n = 27050

>anova (gam2)

#Family: gaussian 
#Link function: identity 
#Formula:
#Concentration ~ s(Time) + s(Time, by = factor(Type)) + factor(Type)
#Parametric Terms:
#df       F p-value
#factor(Type)  1 2319697  <2e-16
#Approximate significance of smooth terms:
#edf Ref.df     F p-value
#s(Time)                     0.6667 0.6667  1471  <2e-16
#s(Time):factor(Type)Control 8.6618 8.6667  1253  <2e-16
#s(Time):factor(Type)Sponge  8.6657 8.6667 11594  <2e-16

>gam.check(gam2)

#Method: GCV   Optimizer: magic
#Smoothing parameter selection converged after 20 iterations.
#The RMS GCV score gradient at convergence was 4.213637e-07 .
#The Hessian was positive definite.
#Model rank =  28 / 29 
#Basis dimension (k) checking results. Low p-value (k-index<1) may
#indicate that k is too low, especially if edf is close to k'.
#k'   edf k-index p-value    
#s(Time)                     9.000 0.667     0.5  <2e-16 ***
#s(Time):factor(Type)Control 9.000 8.662     0.5  <2e-16 ***
#s(Time):factor(Type)Sponge  9.000 8.666     0.5  <2e-16 ***
#Signif. codes:  0 ‘***’0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

>plot(gam2)

#Summary plot
>new.data.IV1 <- rep(seq(min(Average5HourDropData$Time),max(Average5HourDropData$Time)), length(unique(factor(Average5HourDropData$Type)))[1])
>new.data.IV2 <- c(rep("1", length(new.data.IV1)/2), rep("2", length(new.data.IV1)/2))
>new.data <- as.data.frame(cbind(new.data.IV1, new.data.IV2))

>names(new.data) <- c("Time", "Type")

>new.data$Time<- as.numeric(as.character(new.data$Time))

#Error is here
>model.predict <- predict(gam2, newdata = new.data,type="response", se.fit=TRUE)

Error in model.frame.default(Terms[[i]], data, xlev = object$xlevels) : 
  factor factor(Type) has new level 1

Hello, and thanks again for providing code, however this is not a reproducible example because without data the code will not reproduce the error you experienced.

That said, my guess is that your use of factor(Type) throughout the model fit code, has no correlate with the later dataset on which you want to predict. You should predict on data that has the same form as that on which your model was trained. if you want Type to be a factor, then make it a factor everywhere, rather than inline/on the fly during fitting.

I don't think I understand your assumption. Because there is a correlation between type and concentration measurement. This is seen in the significance of the model. Type is a factor throughout the GAM.

Well let's say
Average5HourDropData is this

Time Concentration Type
600 120.3155574 Sponge
601 120.2848155 Sponge
602 120.2538643 Sponge
603 120.2235755 Sponge
604 120.1931335 Sponge
605 120.1661757 Sponge
606 120.1422358 Sponge
607 120.1132464 Sponge
608 120.0870242 Sponge
609 120.0613516 Sponge
610 120.0466342 Sponge
611 120.0327227 Sponge
612 120.0201408 Sponge
613 120.0020943 Sponge
614 119.989341 Sponge
615 119.9760124 Sponge
616 119.9590184 Sponge
617 119.945784 Sponge
618 119.9372138 Sponge
619 119.9233473 Sponge
620 119.9054254 Sponge
621 119.88972 Sponge
622 119.8812268 Sponge
623 119.8618457 Sponge
624 119.8420157 Sponge
600 55.81117364 Control
601 55.87698653 Control
602 55.94301309 Control
603 56.00449191 Control
604 56.03258702 Control
605 56.0574053 Control
606 56.08215191 Control
607 56.11047429 Control
608 56.14380632 Control
609 56.18873294 Control
610 56.24251382 Control
611 56.29417378 Control
612 56.34920071 Control
613 56.4034161 Control
614 56.47717362 Control
615 56.55403724 Control
616 56.6249944 Control
617 56.6987154 Control
618 56.76957097 Control
619 56.84017442 Control
620 56.90509887 Control
621 56.96636615 Control
622 57.0188798 Control
623 57.06590791 Control
624 57.1107242 Control