Error in rmultinom(1, size = 1, prob = prob) : NA in probability vector

Beryl1 · December 26, 2019, 10:08pm

Hello, I'm trying to impute missing values for both continuous and categorical covariates in cox regression, but i keep on getting this error; Error in rmultinom(1, size = 1, prob = prob) : NA in probability vector. Can somebody help me with this. Thank you.

library(survival)
library(reprex)
library(simsurv)

##generating survival data
set.seed(123)
covs <- data.frame(id = 1:10, trt = stats::rbinom(10, 1L, 0.5))
s1 <- simsurv(lambdas = 0.1, gammas = 1.5, betas = c(trt = -0.5),
              x = covs, maxt = 5)

###generating missing data
require("mice")
#> Loading required package: mice
#> Loading required package: lattice
#> 
#> Attaching package: 'mice'
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind
set.seed(2016)
testdata <- MASS::mvrnorm(n = 10, mu = c(10, 5, 0), Sigma = matrix(data = c(1.0, 0.2, 0.2, 0.2, 1.0, 0.2, 0.2, 0.2, 1.0), nrow = 3, byrow = T))
testdata <- as.data.frame(testdata)
result <- ampute(testdata)


fac<-c("a","a","NA","NA","b","c","NA","b","d","c")
fac<-ifelse(fac=="NA",NA, fac)
fac<-as.factor(fac)

###Combining the data
data<-data.frame(result$amp,s1,fac)
head(data)
#>          V1       V2         V3 id eventtime status  fac
#> 1 10.309509 6.047666  0.5174884  1 5.0000000      0    a
#> 2  9.110356 4.832951         NA  2 2.9999360      1    a
#> 3 11.016491 4.755048         NA  3 0.9009437      1 <NA>
#> 4  9.136003       NA  0.5045639  4 0.5164432      1 <NA>
#> 5        NA 6.210910  1.4382012  5 5.0000000      0    b
#> 6 10.436947 5.705906 -0.5634077  6 3.9970411      1    c

################
###Multiple imputation
library(smcfcs)

#imputation method
method=as.matrix(rep("",dim(data)[2]))
rownames(method)=names(data)
method[c("fac"),]="mlogit" 
method[c("V1","V2","V3"),]="norm" 

# impute the data
mi=smcfcs(originaldata=data,smtype="coxph",smformula="Surv(eventtime,status) ~  V1+V2+V3",
              method=method,m=10,numit=10)
#> [1] "Outcome variable(s): eventtime,status"
#> [1] "Passive variables: "
#> [1] "Partially obs. variables: V1,V2,V3,fac"
#> [1] "Fully obs. substantive model variables: "
#> [1] "Imputation  1"
#> [1] "Imputing:  V1  using  V2,V3,fac  plus outcome"
#> [1] "Imputing:  V2  using  V1,V3,fac  plus outcome"
#> [1] "Imputing:  V3  using  V1,V2,fac  plus outcome"
#> [1] "Imputing:  fac  using  V1,V2,V3  plus outcome"
#> Error in rmultinom(1, size = 1, prob = prob): NA in probability vector .In addition: There were 50 or more warnings (use warnings() to see the first 50)

^{Created on 2019-12-26 by the reprex package (v0.3.0)}

technocrat · December 27, 2019, 12:15am

That is a fabulous reprex. Thank you. I'm analyzing the smcfcs code and trying to work backwards from the error in rmultinom. It's a bit tricky because smcfcs is just a wrapper for smcfcs.core which is where rmultinom and where, obviously prob is the problem. Now, I'm looking at the multiple places prob is defined. One of them is in this block

            else if ((smtype=="coxph") | (smtype=="casecohort")) {
              outmodxb <-  model.matrix(as.formula(smformula),tempData)
              outmodxb <- as.matrix(outmodxb[,2:dim(outmodxb)[2]]) %*% as.matrix(outcomeModBeta)
              s_t = exp(-H0[i]* exp(outmodxb))
              prob = exp(1 + outmodxb - (H0[i]* exp(outmodxb)) ) * H0[i]
              prob = d[i]*prob + (1-d[i])*s_t
              reject = 1*(uDraw > prob )
            }

I'm now trying to figure out how the input data is setting prob to create NAs.

As Stapleton said, I may be some time. Hopefully, someone more expert will chime in while I'm trying.

technocrat · December 27, 2019, 1:12am

Could you look at survival::lung to see if there are non-quantitative variables suitable for further testing?

# simplified to test with actual data quantitative variables
suppressPackageStartupMessages(library(dplyr)) 
library(survival)
library(reprex)
library(simsurv)
library(smcfcs)

# substitute lung data set from survival package
data <- lung
# reduce to simpler form
d2 <- data %>% dplyr::select(time,status,meal.cal,wt.loss)

# impute the data
mi=smcfcs(originaldata=d2,smtype="coxph",smformula="Surv(time,status) ~  meal.cal + wt.loss", method=c("","","norm","norm"))
#> [1] "Outcome variable(s): time,status"
#> [1] "Passive variables: "
#> [1] "Partially obs. variables: meal.cal,wt.loss"
#> [1] "Fully obs. substantive model variables: "
#> [1] "Imputation  1"
#> [1] "Imputing:  meal.cal  using  wt.loss  plus outcome"
#> [1] "Imputing:  wt.loss  using  meal.cal  plus outcome"
#> [1] "Imputation  2"
#> [1] "Imputation  3"
#> [1] "Imputation  4"
#> [1] "Imputation  5"
#> Warning in smcfcs.core(originaldata, smtype, smformula, method,
#> predictorMatrix, : Rejection sampling failed 347 times (across all variables,
#> iterations, and imputations). You may want to increase the rejection sampling
#> limit.
# output of mi exceeds allowable

^{Created on 2019-12-26 by the reprex package (v0.3.0)}

Beryl1 · December 27, 2019, 3:31pm

Hi @technocrat. Thank you.
I think that code got me more confused. So where exactly is the error coming from? Is it from the data or the methods used?

Beryl1 · December 27, 2019, 3:33pm

let me try the survival::lung.

technocrat · December 27, 2019, 6:02pm

Hi, @Beryl1

Sorry to be unhelpful. Here's my thought process

Even well-constructed test data may produce anomalous results, so actual data that's been used to illustrate functions involved is preferable.
We're working with a smtype="coxph" to smcfcs.
The source code for smcfcs indicates that it calls smcfcs.core which assigns prob for smtype="coxph, and that is the argument that throws the error rmultinom(1, size = 1, prob = prob): NA in probability vector
The only code in smcfcs.core that invokes rmultinom is catdraw. It's signature is

catdraw <- function(prob) {
  (1:length(prob))[rmultinom(1,size=1,prob=prob)==1]
}

The prob argument to rmultinom is in the form c(0.1,0.2,0.8)
catdraw is called by a branch in

if (method[targetCol]=="logreg") {
            directImpProbs = directImpProbs[,2]
            if (is.factor(imputations[[imp]][,targetCol])==TRUE) {
              imputations[[imp]][imputationNeeded,targetCol] <- levels(imputations[[imp]][,targetCol])[1]
              imputations[[imp]][imputationNeeded,targetCol][rbinom(length(imputationNeeded),1,directImpProbs)==1] <- levels(imputations[[imp]][,targetCol])[2]
            }
            else {
              imputations[[imp]][imputationNeeded,targetCol] <- rbinom(length(imputationNeeded),1,directImpProbs)
            }
          }
          else {
            imputations[[imp]][imputationNeeded,targetCol] <- levels(imputations[[imp]][,targetCol])[apply(directImpProbs, 1, catdraw)]
          }

The branch with catdraw is the fall through when neither logreg nor rbinom imputations are needed.
The imputations needed ultimately derive from the positional arguments to

method=c("","","norm","norm"))

in the arguments to smcfcs
8. In my example, the first two positional arguments are "" because they represent the model terms; the second are norm because I wanted to test imputation of continuous data.
9. The continuous data had to travel through the same path as any other arguments that could be given if a larger data frame than the subset of lungs that I used.
11. c("V1","V2","V3") were specified as norm.
12. That would seem to assume that the non-NA distribution of those vectors is normal. If they are not, they would be passed through by catpath to trip by rmultnorm.
13. The method that is not norm, mlogit for fac, may also produce a vector of probabilities that includes NA which would cause the same problems.
14. Because lung with two norm methods did not throw the error, the data in the variables I selected must not produce NA in the arguments passed to catdraw.
15. ∴ If we added an mlogit variable to the subset of lung we could test for the presence of NA.
16. If NA is not present, the rmultnorm error will not occur.
17. That means your reprex fac variable is the root cause.

Does that make sense?

Beryl1 · December 27, 2019, 10:12pm

Hi, @technocrat,

It makes a lot of sense. Looks like the problem is with the factor variables. But for the survival::lung data, I have made a bit of changes in your example. I have added one more variable (factor) to d2 and it seems to be working. So I don't quit understand why it doesn't work with my example and the actual data that i'm working on.

# simplified to test with actual data quantitative variables
suppressPackageStartupMessages(library(dplyr)) 
library(survival)
library(reprex)
library(simsurv)
library(smcfcs)

# substitute lung data set from survival package
data <- lung
# reduce to simpler form
d2 <- data %>% dplyr::select(time,status,meal.cal,wt.loss,ph.ecog)
d2$ph.ecog<-as.factor(d2$ph.ecog)

# impute the data
mi=smcfcs(originaldata=d2,smtype="coxph",smformula="Surv(time,status) ~  meal.cal + wt.loss+ph.ecog", method=c("","","norm","norm","mlogit"))
#> [1] "Outcome variable(s): time,status"
#> [1] "Passive variables: "
#> [1] "Partially obs. variables: meal.cal,wt.loss,ph.ecog"
#> [1] "Fully obs. substantive model variables: "
#> [1] "Imputation  1"
#> [1] "Imputing:  meal.cal  using  wt.loss,ph.ecog  plus outcome"
#> [1] "Imputing:  wt.loss  using  meal.cal,ph.ecog  plus outcome"
#> [1] "Imputing:  ph.ecog  using  meal.cal,wt.loss  plus outcome"
#> [1] "Imputation  2"
#> [1] "Imputation  3"
#> [1] "Imputation  4"
#> [1] "Imputation  5"
#> Warning in smcfcs.core(originaldata, smtype, smformula, method,
#> predictorMatrix, : Rejection sampling failed 294 times (across all
#> variables, iterations, and imputations). You may want to increase the
#> rejection sampling limit.

^{Created on 2019-12-27 by the reprex package (v0.3.0)}

technocrat · December 28, 2019, 1:43am

Sounds like progress. It's a valuable discovery that ph.ecog, a factor, works with method mlogit. So, the question is why the code that produces a vector of probabilities should include one or more NAs.

I thought at first it might be because the original data was to small, so I tried increasing nrow to 100, from 10, but no difference.

The next possibility that occurs to me is that perhaps there were two few factors. Your lungs adaptation, however, I think also had only three factors?

What's left, I wonder, may be that the factors are random. Could you test that on your data?

Beryl1 · December 28, 2019, 2:01am

Thank you @technocrat for your help. for my data, I have three factor variables each with at least three levels. How do you mean the factors are random? and how can i test that? If they are random, what should I do with them?

technocrat · December 28, 2019, 5:11am

Now the mystery is how your factor distributions differ from those you constructed for lung. I'm still trying to figure out what accounts for the difference. I'm now quite a bit further over my head than when we started.

system · January 18, 2020, 5:11am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.