Logistic Regression using glmnet(): accuracy measure from mean() returns 0

blackish952 · June 19, 2018, 2:58am

Hello,
I am building a Logistic Regression Model using glmnet() package:

    > # Prep Training and Test data.
    > trainDataIndex <- sample(1:nrow(df), 0.7*nrow(df))  # 70% training data
    > trainData <- df[trainDataIndex, ]
    > testData <- df[-trainDataIndex, ]
    > set.seed(100)
    > trainData <- 
    +   trainData %>%
    +   dplyr::mutate(CUST_REGION_DESCR = 
    +                   forcats::fct_relabel(CUST_REGION_DESCR, ~ trimws(.x)))
    > testData <- 
    +   testData %>%
    +   dplyr::mutate(CUST_REGION_DESCR = 
    +                   forcats::fct_relabel(CUST_REGION_DESCR, ~ trimws(.x)))
    > str(trainData)
    'data.frame':	693843 obs. of  4 variables:
     $ cust_prog_level  : Factor w/ 14 levels "B","C","D","E",..: 9 7 10 9 10 9 10 5 10 5 ...
     $ CUST_REGION_DESCR: Factor w/ 8 levels "CORPORATE REGION",..: 2 6 7 6 8 8 4 7 7 6 ...
     $ Sales            : num  92.7 2356 39 239.6 26 ...
     $ New_Product_Type : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
    > str(testData)
    'data.frame':	297362 obs. of  4 variables:
     $ cust_prog_level  : Factor w/ 14 levels "B","C","D","E",..: 9 5 9 9 9 9 3 3 5 3 ...
     $ CUST_REGION_DESCR: Factor w/ 8 levels "CORPORATE REGION",..: 3 3 6 6 7 6 7 2 2 4 ...
     $ Sales            : num  150.2 68.5 68.1 72.1 60.1 ...
     $ New_Product_Type : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
    
    > x = model.matrix(New_Product_Type ~.,data=trainData)
    
    > cvfit = cv.glmnet(x, y=as.factor(trainData$New_Product_Type), alpha=1, family="binomial",type.measure = "mse")
    
    > lambda_1se <- cvfit$lambda.1se
    
    > coef(cvfit,s=lambda_1se)
    23 x 1 sparse Matrix of class "dgCMatrix"
                                                    1
    (Intercept)                            0.02946581
    (Intercept)                            .         
    cust_prog_levelC                       0.14012975
    cust_prog_levelD                       .         
    cust_prog_levelE                       0.13339906
    cust_prog_levelG                      -0.05325043
    cust_prog_levelI                       0.21440592
    cust_prog_levelL                       0.26273503
    cust_prog_levelM                       .         
    cust_prog_levelN                       0.26620261
    cust_prog_levelP                      -0.05166799
    cust_prog_levelR                      -0.33054803
    cust_prog_levelS                       .         
    cust_prog_levelX                       0.57508875
    cust_prog_levelZ                       1.20748454
    CUST_REGION_DESCRMOUNTAIN WEST REGION -0.20993854
    CUST_REGION_DESCRNORTH CENTRAL REGION -0.04035331
    CUST_REGION_DESCRNORTH EAST REGION     0.01082858
    CUST_REGION_DESCROHIO VALLEY REGION    0.03077584
    CUST_REGION_DESCRSOUTH CENTRAL REGION  .         
    CUST_REGION_DESCRSOUTH EAST REGION     0.10606213
    CUST_REGION_DESCRWESTERN REGION       -0.17587036
    Sales                                 -0.01223843
    
    > #get test data
    > x_test <- model.matrix(New_Product_Type~.,data = testData)
    > #predict New_Product_Type, type=”New_Product_Type”
    > lasso_prob <- predict(cvfit,newx = x_test,s=lambda_1se,type="response")
    
    > #translate probabilities to predictions
    > lasso_predict <- rep("neg",nrow(testData))
    > lasso_predict[lasso_prob>.5] <- "pos"
    > #confusion matrix
    > table(pred=lasso_predict,true=testData$New_Product_Type)
         true
    pred       0      1
      neg 207840  60865
      pos   8697  19960
    > #accuracy
    
    > lasso_predict[lasso_prob>.8] <- "pos"
    > #confusion matrix
    > table(pred=lasso_predict,true=testData$New_Product_Type)
         true
    pred       0      1
      neg 207840  60865
      pos   8697  19960

When I test the accuracy, the return value is 0

    > #accuracy
    > mean(lasso_predict==testData$New_Product_Type)
    [1] 0

So does it mean my model have ZERO accuracy?

joels · June 19, 2018, 5:52am

Your example isn't reproducible, but it looks like your code is analogous to the example below. The outcome New_Product_Type has values of "1" or "0". But you're setting lasso_predict to have values of "pos" or "neg". Since the labels of the actual and predicted values never match, the number "correct" is always zero, even if the predictions are perfect (as they are in the example below).

# Actual outcomes 
New_Product_Type = c("1","0","0","1","1","0")

# Predicted outcomes
lasso_predict = c("pos","neg","neg","pos","pos","neg")

New_Product_Type == lasso_predict

[1] FALSE FALSE FALSE FALSE FALSE FALSE

mean(New_Product_Type == lasso_predict)

[1] 0

A couple of other things:

First, the following line in your code determines the predicted classes.

lasso_predict[lasso_prob>.5] <- "pos"

After creating a confusion matrix for the predictions, you then run:

lasso_predict[lasso_prob>.8] <- "pos"

This doesn't change lasso_predict, because predictions with probability greater than 0.5 were already all set to "pos". That's why both confusion matrices are the same. Reinitialize lasso_predict or create a new prediction vector to get a confusion matrix for the second case (or reverse the order of the code to set the lasso_predict values).

Using the confusion matrix, the accuracy is the sum of the diagonal divided by the sum of all four values (although accuracy isn't necessarily a particularly good measure of model performance; see, for example, here and here).

If might be easier to keep track of the various predictions by adding them as columns to the test data frame, rather than generating lots of stand-alone vector objects.

Second, instead of type.measure="mse", for a classification model "auc", "class", or "deviance" are better loss functions to use.