How do i build a model using Glove word embeddings and predict on Test data using text2vec in R


#1

I am building a classification model on text data into two categories(i.e. classifying each comment into 2 categories) using GloVe word embeddings. I have two columns, one with textual data(comments) and the other one is a binary Target variable(whether a comment is actionable or not). I was able to generate Glove word embeddings for textual data using the following code from text2vec documentation.

glove_model <- GlobalVectors$new(word_vectors_size = 50,vocabulary = 
glove_pruned_vocab,x_max = 20L)
#fit model and get word vectors
word_vectors_main <- glove_model$fit_transform(glove_tcm,n_iter = 20,convergence_tol=-1)
word_vectors_context <- glove_model$components
word_vectors <- word_vectors_main+t(word_vectors_context)

How do i build a model using these word embeddings and generate predictions on test data?


#2

The article in the keras examples “pretrained_word_embeddings” explains how to do this.

(This assumes you want to use keras to train a neural network that uses your embedding as an input layer.)

In a nutshell, you include the embedding as a frozen layer, i.e. explicitly tell the network not to update the weights in your embedding layer.

The essential code snippet from this page is this - note the trainable = FALSE:

embedding_layer <- layer_embedding(
  input_dim = num_words,
  output_dim = EMBEDDING_DIM,
  weights = list(embedding_matrix),
  input_length = MAX_SEQUENCE_LENGTH,
  trainable = FALSE
)

Then you use this frozen layer in your model:

preds <- sequence_input %>%
  embedding_layer %>% 
  layer_conv_1d(filters = 128, kernel_size = 5, activation = 'relu') %>% 
  layer_max_pooling_1d(pool_size = 5) %>% 
  layer_conv_1d(filters = 128, kernel_size = 5, activation = 'relu') %>% 
  layer_max_pooling_1d(pool_size = 5) %>% 
  layer_conv_1d(filters = 128, kernel_size = 5, activation = 'relu') %>% 
  layer_max_pooling_1d(pool_size = 35) %>% 
  layer_flatten() %>% 
  layer_dense(units = 128, activation = 'relu') %>% 
  layer_dense(units = length(labels_index), activation = 'softmax')

#3

Thanks a lot Andrie. It worked and gave a drastic improvement to my model’s cv_accuracy