gmnl package for discrete choice experiment

Hello

I try to perform a latent class analysis on my data from a discrete choice experiment. The respondents needed to chose between 2 options with as attributes: the number of children they prefer, and the educational level they prefer for their children (stated as a mixture of the number of children). The first rows of my data look like this:

Respondent Block Choice card Chosen FNoPrimary  FPrimary FSecondary FTertiary MNoPrimary
 1          1           1      1    0.0000000  0.0000000   0.00     0.0000000  0.0000000
 1          1           1      0    0.3333333  0.6666667   0.00     0.0000000  0.0000000
 1          2          12      0    0.3333333  0.3333333   0.00     0.0000000  0.0000000
 1          2          12      1    0.1666667  0.0000000   0.00     0.3333333  0.1666667
 1          3           2      0    0.0000000  0.0000000   1.00     0.0000000  0.0000000
 1          3           2      1    0.0000000  0.0000000   0.25     0.0000000  0.0000000
  MPrimary MSecondary MTertiary NChildren Age District   Religion Indigenous Ethnic group    Sex
1        0       1.00 0.0000000         1  18        0 Protestant          0      Wolaita Female
2        0       0.00 0.0000000         3  18        0 Protestant          0      Wolaita Female
3        0       0.00 0.3333333         9  18        0 Protestant          0      Wolaita Female
4        0       0.00 0.3333333        12  18        0 Protestant          0      Wolaita Female
5        0       0.00 0.0000000         1  18        0 Protestant          0      Wolaita Female
6        0       0.25 0.5000000         4  18        0 Protestant          0      Wolaita Female
       Educational level Studentornot Farmerornot Marital status Having children Ever used contraception
1 High school - grade 10            1           0              0               0                       0
2 High school - grade 10            1           0              0               0                       0
3 High school - grade 10            1           0              0               0                       0
4 High school - grade 10            1           0              0               0                       0
5 High school - grade 10            1           0              0               0                       0
6 High school - grade 10            1           0              0               0                       0
  Alternative
1           1
2           2
3           1
4           2
5           1
6           2

I looked at all the packages available in R and I think that only the gmnl package can handle my type of data and is able to add covariates. However, if I compare the output of my latent class analysis for a simple linear model with only 2 covariates (age and district) (as stated below), I become a totally different output then when I perform the same analysis with Stata (see code below).

in R:

defining_data <- mlogit.data(final_data_alternativeadded, id.var = "Respondent", choice = "Chosen", alt.var = "Alternative", chid.var="Choice.card", group.var = "Block", varying = 7:15, shape = "long")
mnl <- gmnl(Chosen ~ 1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + z | 0 | 0 | 0 | Age + District, data = defining_data, model = 'lc', Q = 3) 
summary(mnl)

in Stata:

ssc install lclogit
ssc install fmlogit
lclogit chosen fprimary fsecondary ftertiary mnoprimary mprimary msecondary mtertiary block nchildren, group(choicecard) id(respondent) nclasses(3) membership(age district)

I tried to make all my variables numeric, to multiply the mixture proportions by the number of children to get values which are closer to each other, to order my dataset based on the value of the number of choice cards per respondents... but I always get other values for the latent class probabilities. Does someone know why?

Thank you very much in advance

Kind regards

Eva

Hi, @Eva_Boonaert,

I lack Strata, but I do have a suggestion, which is to run the mlogit help page examples against strata and see if you come up with the same disconnect. If so, my guess would be that there are different default arguments to be run down.

Good luck, and please report back for the benefit of future pilgrims! Thanks.

Hey, thank you for your answer!

I tried it with an existing open dataset:

Stata:

use http://fmwww.bc.edu/repec/bocode/t/traindata.dta, clear
ssc install lclogit
ssc install fmlogit
lclogit y price contract local wknown tod seasonal, group(gid) id(pid) nclasses(3) seed(12345)

R:

getwd()
traindata <- read.xlsx("traindata.xlsx", 1, header = TRUE)
library(mlogit)
library(gmnl)
traindata[1:nrow(traindata),11] <- seq(1,4) #to add a column with the alternatives per choice numbered from 1 to 4
names(traindata) <- c('y', 'price', 'contract', 'local', 'wknown', 'tod', 'seasonal', 'gid', 'pid', 'X_xi', 'Alternative')
TM <- mlogit.data(traindata, choice = "y", id.var = "pid", alt.var = "Alternative", chid.var = "gid", shape = "long")
mnl <- gmnl(y ~ price + contract + local + wknown + tod + seasonal | 0 | 0 | 0 | 1, data = TM, model = 'lc', Q = 3)
summary(mnl)

However, Stata states that they become a loglikelihood of -1117.9987 while R becomes -1329.5 and both have totally different parameter estimations.

Does someone know why this is the case? The lclogit function in Stata uses the Expectation-Maximization algorithm while gmnl uses a Maximum Likelihood approach but I don't think that the outcomes can differ that much because of this. Does someone know what I do wrong with the code?

Thank you very much in advance

Kind regards

Eva

1 Like

I think it would help to start with a shared-in-common example from the gmnl help page:

library(gmnl)
#> Loading required package: maxLik
#> Loading required package: miscTools
#> 
#> Please cite the 'maxLik' package as:
#> Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
#> 
#> If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
#> https://r-forge.r-project.org/projects/maxlik/
#> Loading required package: Formula
data("TravelMode", package = "AER")
library(mlogit)
#> Loading required package: zoo
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
#> Loading required package: lmtest
TM <- mlogit.data(TravelMode, choice = "choice", shape = "long", 
                 alt.levels = c("air", "train", "bus", "car"), chid.var = "individual")
## Not run: 
## S-MNL model, ASCs not scaled
smnl <- gmnl(choice ~ wait + vcost + travel + gcost| 1, data = TM, 
             model = "smnl", R = 100, 
             notscale = c(1, 1, 1, rep(0, 4)))
#> 
#> The following variables are not scaled:
#> [1] "train:(intercept)" "bus:(intercept)"   "car:(intercept)"  
#> Estimating SMNL model
summary(smnl)
#> 
#> Model estimated on: Tue Oct 29 11:01:49 2019 
#> 
#> Call:
#> gmnl(formula = choice ~ wait + vcost + travel + gcost | 1, data = TM, 
#>     model = "smnl", R = 100, notscale = c(1, 1, 1, rep(0, 4)), 
#>     method = "bfgs")
#> 
#> Frequencies of categories:
#> 
#>     air   train     bus     car 
#> 0.27619 0.30000 0.14286 0.28095 
#> 
#> The estimation took: 0h:0m:5s 
#> 
#> Coefficients:
#>                     Estimate Std. Error z-value  Pr(>|z|)    
#> train:(intercept) -1.1613676  0.5883254 -1.9740 0.0483792 *  
#> bus:(intercept)   -1.9137960  0.7029972 -2.7223 0.0064822 ** 
#> car:(intercept)   -7.3090849  1.2810533 -5.7055 1.160e-08 ***
#> wait              -0.1391785  0.0208491 -6.6755 2.463e-11 ***
#> vcost             -0.1255688  0.0342655 -3.6646 0.0002477 ***
#> travel            -0.0184355  0.0042045 -4.3847 1.162e-05 ***
#> gcost              0.0992449  0.0285916  3.4711 0.0005183 ***
#> tau                0.4645904  0.1229279  3.7794 0.0001572 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Optimization of log-likelihood by BFGS maximization
#> Log Likelihood: -178.16
#> Number of observations: 210
#> Number of iterations: 61
#> Exit of MLE: successful convergence 
#> Simulation based on 100 draws

Created on 2019-10-29 by the reprex package (v0.3.0)

and to compare the loglikelihood result of -178.16 against what you would get in Strata using the lc model argument,

a string indicating which model is estimated. The options are "mnl" for the Multinomial Logit Model, "mixl" for the Mixed Logit Model, "smnl" for the Scaled Multinomial Logit Model, "gmnl" for the Generalized Multinomial Logit Model, "lc" for the Latent Class Multinomial Logit Model, and "mm" for the Mixed-Mixed Multinomial Logit Model.

with the same number of classes, Q.

If you get identical results in R and Strata then your disconnect can probably be solved by lining up corresponding models and tuning parameters.

Thank you very much for your response!

I tried it with your example as follows:

data("TravelMode", package = "AER")
#add column with the index of the choices counting up which is needed in Stata
TravelMode[1:nrow(TravelMode),10] <- rep(1:(nrow(TravelMode)/4), each=4) 
#make the column with choices numeric which is needed for Stata (called v10)
r <- c() 
for(i in 1:nrow(TravelMode)){
  if (TravelMode$choice[i]=="yes") {
    r[i] <- as.numeric(1)
  } else {
    r[i] <- as.numeric(0)
  }
}
TravelMode$choice <- r
#define the data
TM <- mlogit.data(TravelMode, choice = "choice", shape = "long", 
                  alt.levels = c("air", "train", "bus", "car"), chid.var = "individual")
#estimate the model via latent class analysis
mnl_TM <- gmnl(choice ~ wait + vcost + travel + gcost | 0 | 0 | 0 | 1, data = TM, model = 'lc', Q = 3) 
summary(mnl_TM)

in Stata:

ssc install lclogit
ssc install fmlogit
lclogit choice wait vcost travel gcost, group(v10) id(individual) nclasses(3) 

Now I become the same loglikelihood (-210.44). However, the model in R generates NAs (Warning message: In sqrt(diag(vcov(object))) : NaNs produced) for the estimate of class 2 and 3. So, I think the problem is related to the use of a different estimation procedure (Maximum Likelihood in gmnl in R vs Expectation Maximization in lclogit in Stata).

1 Like

Hi,

I'm getting beyond my comfort level based solely on reading the docs, but consider this example with slightly different parameters, which doesn't produce NaNs or NAs

library(mlogit)
#> Loading required package: Formula
#> Loading required package: zoo
#> 
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#> 
#>     as.Date, as.Date.numeric
#> Loading required package: lmtest
library(gmnl)
#> Loading required package: maxLik
#> Loading required package: miscTools
#> 
#> Please cite the 'maxLik' package as:
#> Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
#> 
#> If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
#> https://r-forge.r-project.org/projects/maxlik/
data("TravelMode", package = "AER")
TM <- mlogit.data(TravelMode, 
                           choice = "choice", 
                           shape = "long", 
                           alt.levels = c("air", "train", "bus", "car"), 
                           chid.var = "individual")
lc2 <- gmnl(choice ~ wait + vcost | 0 | 0 | 0 | income + size, 
                 data = TM, 
                 model = "lc",
                 Q = 2, 
                 method = 'bfgs')
#> Estimating LC model
summary(lc2)
#> 
#> Model estimated on: Wed Oct 30 21:02:24 2019 
#> 
#> Call:
#> gmnl(formula = choice ~ wait + vcost | 0 | 0 | 0 | income + size, 
#>     data = TM, model = "lc", Q = 2, method = "bfgs")
#> 
#> Frequencies of categories:
#> 
#>     air   train     bus     car 
#> 0.27619 0.30000 0.14286 0.28095 
#> 
#> The estimation took: 0h:0m:0s 
#> 
#> Coefficients:
#>                Estimate Std. Error z-value  Pr(>|z|)    
#> class.1.wait  -0.085539   0.016936 -5.0507 4.403e-07 ***
#> class.1.vcost  0.064883   0.013916  4.6624 3.126e-06 ***
#> class.2.wait   0.032227   0.014533  2.2174  0.026593 *  
#> class.2.vcost -0.050561   0.016343 -3.0936  0.001977 ** 
#> (class)2       1.276352   0.692142  1.8441  0.065174 .  
#> income:class2 -0.030496   0.013062 -2.3348  0.019555 *  
#> class2:size   -0.741815   0.370211 -2.0038  0.045096 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Optimization of log-likelihood by BFGS maximization
#> Log Likelihood: -237.5
#> Number of observations: 210
#> Number of iterations: 93
#> Exit of MLE: successful convergence

Created on 2019-10-30 by the reprex package (v0.3.0)

Thank you for your reply.

With Stata I got the same loglikelihood of -237.5 but the parameter estimates differ:

Iteration 52:  log likelihood = -237.50393

Latent class model with 2 latent classes

Choice model parameters and average classs shares
--------------------------------
    Variable |  Class1   Class2 
-------------+------------------
        wait |   0.032   -0.086 
       vcost |  -0.050    0.065 
-------------+------------------
 Class Share |   0.298    0.702 
--------------------------------

Class membership model parameters : Class2 = Reference class
--------------------------------
    Variable |  Class1   Class2 
-------------+------------------
      income |  -0.030    0.000 
        size |  -0.739    0.000 
       _cons |   1.276    0.000 
--------------------------------

Note: Model estimated via EM algorithm

So, I think the problem will be related to the algorithm itself?

This is my code in Stata:

ssc install lclogit
ssc install fmlogit
lclogit choce wait vcost, group (v10) id(individual) nclasses(3) membership(income size)

Eva

Stata lclogit uses the expectation-maximization algorithm, while gmnl with model = "lc" uses maximum simulated likelihood. MSL in Stata might provide an alternative way of synchronizing with R.