I have a sample of items that I've tracked to see how many times they were mentioned in the mass news media (so my dependent variable is MenCount). The main independent variable (OAStatus) I want to track is categorical with three unordered levels: gold, green, and pink. I do have at least one continuous independent variable I want to include, JJIF. I have about 626,000 records, 85 percent of which have no news mention, so MenCount=0. But I have plenty of items in each color level that do have news mentions. The median count is 3, but I have some pretty big outliers with a count of 3,000+, meaning my data appears to have a pretty big overdispersion. Because of the large number of zeroes that I have and the overdispersion, it seems I should use a zero-inflated negative binomial model.
When I first started running this in R, using different mixes of variables, I had no problem with my main two independent variables, OAStatus and JJIF. However, after playing around with different mixes, when I went back to just these two, I started getting error messages and NAs in my result. I've narrowed the problem down to releveling OAStatus. If I let R set the level by default, so alphabetical, gold is my intercept, and I get no NAs or warnings. However, I need pink to be my intercept, but when I relevel and run the code, I get this:
Call: zeroinfl(formula = MenCount ~ OAStatus | OAStatus, data = AllItems_TotalCount, dist = "negbin") Pearson residuals: Min 1Q Median 3Q Max -0.2530 -0.2474 -0.1850 -0.1850 691.1543 Count model coefficients (negbin with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) 0.118191 0.010052 11.76 <2e-16 *** OAStatusGold 0.293901 0.014946 19.66 <2e-16 *** OAStatusGreen 1.613422 0.017901 90.13 <2e-16 *** Log(theta) -2.737584 0.005584 -490.27 <2e-16 *** Zero-inflation model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -12.4955 NA NA NA OAStatusGold 12.2293 NA NA NA OAStatusGreen -0.6785 3.8898 -0.174 0.862 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Theta = 0.0647 Number of iterations in BFGS optimization: 24 Log-likelihood: -5.035e+05 on 7 Df Warning message: In sqrt(diag(object$vcov)) : NaNs produced
I'm only using the one independent variable, so I don't see how collinearity would be a problem. I've also tried to relevel so that green was the intercept and got even more NAs. Same if I try adding in my other independent variable, JJIF. Any ideas why the model would work only if OAStatus was leveled one certain way but not any other? And why would it have worked previously but no longer?
I should also note that I've tried several ways of releveling:
AllItems_TotalCount$OAStatus <- relevel(AllItems_TotalCount$OAStatus, ref = "Green")
AllItems_TotalCount$OAStatus <- factor(AllItems_TotalCount$OAStatus, levels = c("Paywalled", "Gold", "Green"))
I even tried to cheat and just change the value "pink" to "blue" so it would automatically come first, but I still had the problem of the NAs.