What does nnet minimise


#1

I'm instructed by our professor to implement a neural network classifier with single hidden layer on the Wine Quality Data. This is divided in two parts, so I will implement two models and these models have ordered classes. I tried using nnet function in the nnet package for this assignment.

I've noted that the process does not always converge, and if it does, they may converge to quite different values. I expected some differences, but those value are really far apart. I wanted to investigate and went through the documentation to know the convergence criterion. But I noted that that's not mentioned anywhere. So my question is that whether it minimises training set error, or it is splitting the dataset in training and validation (though I couldn't find any such option) and minimises validation error? But whatever the criterion is, the values are in thousands initially and then hundreds. So I'm utterly confused.

Any help will be appreciated.


How to choose loss functions
#2

I can offer only a semi-informed response, which is that neural networks are notorious for their opacity. It's like talking to a gifted analyst who can come up with useful answers based on specific training data sets that work on their corresponding training sets but refuses to divulge their assumptions or methods.


#3

Happy Holidays :smile:

I understand what you mean, but it would be nice to know. It'll help me to understand whether the model is actually doing right or not.

I was looking through the documentation of keras, and there I can choose any of the loss functions available here [most probably I can define my own loss functions too, but currently, that's not the issue]. Such an option will really be preferable.


#4

Happy New Year.

Luckily, your data set is large enough to partition between a test set and a training set. That allows you to train your model and then gauge its predictive power on a new set of data, which, when you think about it is the whole point of any modeling technique.

help(nnet) will give you the function signature and the outputs. The example at the bottom will show you one technique to divide the data into training and test sets.

When I was learning R I was constantly muttering to myself that help needed a help page. Well, it does

help(help)

But the key to really understanding what you see as a result of help(nnet) is something you learned before college: f(x) = y or f(a,b,c,d) = y. That's right, functions. Although R has some procedural programming features like those that dominate C, C++, Java, Python, etc.., such as for loops, the majority of the time you are using R, it's a matter of understanding which arguments to a function are mandatory, what types of arguments are optional and the classes of each. And then understanding what the function returns.

Take a careful read of help(nnet). It has a ton of available arguments, but I doubt that your professor intends students to use them all.

Next, look at the section Values. You will learn that when you invoke nnet(arguments) you get back (surprise!) an nnet object. So, for example,

my_net <- nnet(my_arguments)
str(my_net)

will show you what's included.

As to your basic question my (again) semi-informed guess is that it is minimizing residuals, optimizing the value of fitting criterion plus weight decay term and checking whether the maximum number of iterations was reached.

To help more, and to draw others into the conversation, a reproducible example, called a reprex would be extremely useful.


#5

This is divided in two parts, so I will implement two models and these models have ordered classes.

You'll want to use a single model here. Two models is not the same as a model with a hidden single layer. (Let me know if have more questions about this and I'll elaborate.)

If you want to train a neural net with a single hidden layer, you can do that with nnet, although I would recommend using keras, as the resources will be much more up to date. Nonetheless, an nnet example looks like:

library(nnet)

net <- nnet(
  species ~ .,
  data = iris,
  size = 50,    # number of nodes in hidden layer
  rang = 0.1,   # initial weights uniformly from [-0.1, 0.1]
  decay = 5e-4  # decrease learning rate over time
)

In general, training a neural net is not a deterministic progress, so there's no reason to expect the weights to be the same. You might expect the predictions to be somewhat similar, but even those can vary from net to net.


#6

Thanks for the reply, but I don't think I've made my question clear. My problem is not the implementation. I want to know what is the metric being optimised by the nnet function. I don't think a reproducible example is relevant for that question. I can't find anything other than the following in the documentation, which does not really answer my question:

Optimization is done via the BFGS method of optim.

But still, if it helps, here's what I have done:

# loading package
library(package = "nnet")

# loading dataset
red_wine <- read.csv2(file = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
                      header = TRUE)

# modifying dataset to avoid very low class proportions
red_wine$quality <- sapply(X = red_wine[, 12],
                           FUN = function(x)
                           {
                             if((x == 3) | (x == 4))
                             {
                               x <- "low"
                             } else if(x == 5)
                             {
                               x <- "lower_middle"
                             } else if(x == 6)
                             {
                               x <- "higher_middle"
                             } else
                             {
                               x <- "high"
                             }
                           })
red_wine$quality <- factor(x = red_wine[, 12],
                           levels = c("low",
                                      "lower_middle",
                                      "higher_middle",
                                      "high"),
                           ordered = TRUE)

# splitting train and test subsets
red_indices <- sample(x = c(TRUE, FALSE),
                      size = nrow(red_wine),
                      replace = TRUE,
                      prob = c(0.8, 0.2))
red_train <- red_wine[red_indices,]
red_test <- red_wine[!red_indices,]

# implementing single hidden layer neural network
no_hidden_nodes <- 30
max_iterations <- 500
red_nn_model <- nnet::nnet(x = red_train[, -12],
                           y = class.ind(red_train[, 12]),
                           size = no_hidden_nodes,
                           softmax = TRUE,
                           maxit = max_iterations,
                           trace = TRUE)
#> # weights:  484
#> initial  value 1830.332882 
#> iter  10 value 1320.948431
#> iter  20 value 1282.400645
#> iter  30 value 1215.595921
#> iter  40 value 1146.536261
#> iter  50 value 1093.389122
#> iter  60 value 1048.528644
#> iter  70 value 1017.228076
#> iter  80 value 992.588107
#> iter  90 value 982.810268
#> iter 100 value 978.270736
#> iter 110 value 971.337690
#> iter 120 value 954.402500
#> iter 130 value 928.415571
#> iter 140 value 900.070623
#> iter 150 value 879.767641
#> iter 160 value 858.583582
#> iter 170 value 840.634227
#> iter 180 value 828.451394
#> iter 190 value 827.021680
#> iter 200 value 824.994217
#> iter 210 value 823.199409
#> iter 220 value 819.632886
#> iter 230 value 815.776615
#> iter 240 value 810.148442
#> iter 250 value 804.609398
#> iter 260 value 799.187227
#> iter 270 value 794.894583
#> iter 280 value 791.952878
#> iter 290 value 791.093384
#> iter 300 value 790.699234
#> iter 310 value 790.200431
#> iter 320 value 787.894134
#> iter 330 value 784.905971
#> iter 340 value 783.498939
#> iter 350 value 781.796986
#> iter 360 value 780.267908
#> iter 370 value 778.546393
#> iter 380 value 775.098411
#> iter 390 value 772.903257
#> iter 400 value 770.701749
#> iter 410 value 769.321650
#> iter 420 value 768.203662
#> iter 430 value 767.204172
#> iter 440 value 766.122717
#> iter 450 value 765.488524
#> iter 460 value 764.656615
#> iter 470 value 764.062411
#> iter 480 value 763.643528
#> iter 490 value 763.381490
#> iter 500 value 763.266544
#> final  value 763.266544 
#> stopped after 500 iterations

# checking performance
predictions <- factor(x = predict(object = red_nn_model,
                                  newdata = red_test[, -12],
                                  type = "class"),
                      levels = c("low",
                                 "lower_middle",
                                 "higher_middle",
                                 "high"),
                      ordered = TRUE)
(confusion_matrix <- table(Predicted = predictions,
                           Actual = red_test[, 12]))
#>                Actual
#> Predicted       low lower_middle higher_middle high
#>   low             3            2             2    0
#>   lower_middle    8          102            50    2
#>   higher_middle   5           45            84   17
#>   high            0            2            18   21

Created on 2018-12-28 by the reprex package (v0.2.1)

As you can see, there's a lot of misclassification. I know that I'll have to do trial and error with number of hidden nodes. But, still, I don't think misclassifications in between the 2nd or 3rd class are expected, as there's lot of data on those two classes.


#7

Thanks for the response.

I am not really comparing or combining the two models for red and white wines. I am considering the two data sets separately.

I understand. But as you will note from my example as given above, a lot of observations from 2nd class are misclassified in 3rd class, and vice versa. Both of these classes have approximately 40% of the data set, so this is very surprising to me. I would have expected much more misclassifications for the 1st and 4th classes.

That being said, I repeated this around 20 times, and I got perfect classification twice. Otherwise, the misclassification patterns were more or less same.


#8

The reprex is extremely helpful; it gives us something concrete to talk about.

When I ran your code I got two different results:

+                            Actual = red_test[, 12]))
               Actual
Predicted       low lower_middle higher_middle high
  low             1            1             0    0
  lower_middle   10           96            48    4
  higher_middle   3           34            69   18
  high            0            1             9   18

               Actual
Predicted       low lower_middle higher_middle high
  low             2            5             0    0
  lower_middle    7           97            40    2
  higher_middle   2           25            66   23
  high            0            2            13   24

Nothing in the argument or data changed. Just to be sure, is that what you mean by 'convergence'? (BTW: I'm a newbie in NN, but I'm an old hand at trouble shooting, once I understand the question, so bear with me.)


#9

decay is the weight decay, not the learning rate decay. nnet doesn't use SGD to train the net (it uses BFGS) so learning rate decay doesn't make sense in that context . Also, in your example there's no split between training set or test set, so you risk overfitting your model to the training set, and perhaps more worryingly, you won't be able to compute a reliable estimate of generalization gap. You could use the subset argument, or, as shown in the help, preprocess the dataset and split it into training and test set. However, I think it should be clear by now that the nnet API is quite awkward. I strongly suggest using keras or h2o instead.

Actually, it ought to be completely deterministic/reproducible, once all seeds are correctly specified. From the nnet documentation, it's not immediately clear which seeds have to be set where, but knowing Venables & Ripley, I'm fairly confident that just adding a simple line such as

set.seed(1)

at the beginning of your script will make your training process completely deterministic. Again, I suggest using keras or h2o rather than nnet.


#10

Of course, the documentation contains the fitting criterion, but as it could be expected of Brian Ripley & William Venables, it's a bit terse. Also, it has to be noted that the vast majority of the packages developed by these two geniuses, to which we all owe a huge lot, was expected to be used with this book by your side:

http://www.stats.ox.ac.uk/pub/MASS4/

So, that also contributes to the tersiness of the documentation (users of their packages are expected to be familiar with the book contents, and with their documentation style). Here's the fitting criterion:

linout	
      switch for linear output units. Default logistic output units.

entropy	
      switch for entropy (= maximum conditional likelihood) fitting. Default by least-squares.

softmax	
      switch for softmax (log-linear model) and maximum conditional likelihood fitting. linout, entropy, softmax and censored are mutually exclusive.

censored	
       A variant on softmax, in which non-zero targets mean possible classes. Thus for softmax a row of (0, 1, 1) means one example each of classes 2 and 3, but for censored it means one example whose class is only known to be 2 or 3.

What these lines are saying, is that the parameters linout, entropy, softmax and censored are mutually exclusive, and they define the loss function. In particular, according to this documentation I would think that for classification the best setting would be softmax = TRUE:

softmax	
      switch for softmax (log-linear model) and maximum conditional likelihood fitting. linout, entropy, softmax and censored are mutually exclusive.

However, what puzzles me is that the example in the documentation, which precisely illustrates how to use nnet in order to perform classification on the iris dataset (thus, a problem fairly similar to yours) doesn't set any of these four parameters to TRUE. According to my interpretation of the documentation, this means that, in this example, Venables & Ripley are effectively using least squares as a fitting criterion:

entropy	
      switch for entropy (= maximum conditional likelihood) fitting. Default by least-squares.

As we can read, the default (entropy=FALSE) should correspond to least-squares fitting.

However, rather than trying to perform the exegesis (a term that Bill Venables used to love :slightly_smiling_face:) of the text, why don't you use keras or h2o? I'm not sure I understand why you have to use nnet.


PS if you're absolutely determined to use nnet, and you want to be 100% sure which loss function is being used, you may try to ask a question here

https://stat.ethz.ch/mailman/listinfo/r-help

Be warned that the bar for posting on the R-help mailing list is set relatively high, so I strongly recommend against posting there before you've become very familiar with this document

https://www.r-project.org/posting-guide.html

Among the other things, you'll need to be familiar with the documentation of nnet before posting, include a reproducible example in your post (like the one you posted here) and try to be as much specific and clear as possible about your question.

Finally, another tool which can help you using nnet is

https://cran.r-project.org/web/packages/validann/

maybe you could even try to contact Greer Humphrey by mail and ask her directly which loss function is being used (I believe she's very familiar with nnet), though she may of course redirect you to either the R-help mailing list, or to Stack Overflow. On Stack Overflow the current policy is to close posts which ask details about specific packages, so you might not be able to ask there.

I say, save yourself the hassle and use one of the other two packages, but of course the choice is yours. Best of luck!


#11

Correct me if I'm wrong, but I think you obtained the two results by running the complete code twice.

In that case, the train and test subsets will differ, as you can note that the number of actual observations in each class vary in the two cases. So I don't think these two are comparable.

I didn't really mean the convergence of the confusion matrices. You'll note that nnet tries to minimise certain quantity, which is possibly least squares error. The iterations will stop once this quantity stops getting decreased with respect to some cutoff. I wanted to know what is this specific quantity that is being minimised.

Thanks


#12

Thanks for the nice response.

I've no compulsion for using nnet. It's just that I first tried with it, and wanted to know what's going on inside.

I'll use keras or h2o now onwards, and shall certainly try validann.

P.S. Greer Humphrey will be she, right?


#13

Whoops, you're right. I knew about a couple celebrities with a Greer in their name (not sure how famous the second one is in the US, but he played in my hometown team, so I had heard about him :slightly_smiling_face:), thus I inadvertently assumed it was a boy's name. However it's actually a girl's name, and for boys it's only used as a middle name. I corrected my post accordingly.


#14

You probably never saw the actor Greer Garson, best remembered for her WWII Mrs Minevar


#15

Aaaand here is your fully reproducible solution, in a RStudio Cloud project:

https://rstudio.cloud/project/160813

There are two scripts:

  1. First, preprocess the data by running the script 1_preprocess_wine_data.R;
  2. Then, define, train and evaluate your model with 2_train_and_evaluate_model.R. As you can see, I tested different values for the number of units in the hidden layer, and the dropout rate: the best combination, for what it concerns the validation accuracy, was the baseline one.

Note the simplicity of the keras API:

  • you need just one function (use_session_with_seed()) to make your whole analysis reproducible, though note some caveats
  • adding modern regularization techniques such as dropout or batch norm is immediate
  • the number of layers and hidden units/layer, the loss function and the type of optimizer used are easy to identify & modify as desired
  • plotting the training history, as well as making inference on the test set, is also very easy.

Have fun trying different hyperparameters, if you want, but I don't think you can do much better than this, unless you add more layers.


#16

Thanks a lot.

Though this does not answer my original question regarding nnet, I'm marking this as the correct answer (this one provides good references), since it gives a nice solution to my project, and it's really really helpful.


closed #17

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.


#18

Just a closing note:it looks like you can't get a high classification accuracy on the red wine classification problem. If you look at the famous Self-Normalizing Neural Networks paper, you'll see that even after trying much more varied architectures than those we were constrained to use (2 layers NN), they could only get a top accuracy of 0.63:

dataset N M SNN MS HW ResNet BN WN LN
wine-quality-red 1599 12 0.6300 0.6250 0.5625 0.6150 0.5450 0.5575 0.6100