NLP in R: I can't have any new words in test/production data?

I've got a continuous response and 3 "comment" field features. I've found by the most naive/clumsy approach below, 1), and from people telling me, that you can't do any NLP in R where your fitted model will see new (unseen) words in the test/production data, because when you make a document matrix of words, they are columns, and R can't predict on new columns/missing old columns. One person went so far as to tell me that this applies to all NLP in general, and that things like Laplace smoothing are just "faking it." Reddit is great :/.

I can get more specific about code breaking, but this is much more about general advice/strategy, than it is about getting something to work without error.

I've been trying 3 approaches:

  1. tm and randomForest, a la:
    This is the most naive, came to it as last resort. The unseen words on prediction data problem here.

    I ran into some issues here, that could be caused by any combination of the following:
    -I read in and encoded in dplyr, and it uses data.table
    -on cv.glmnet, I tried first binning my continuous response variable into classes. when it folds a subset of the data, some of my classes have 2-6 data points (out of 75k! :/), so when it grabs some, I guess it came up with 1 or 0.
    -I tried just changing the "family" argument to gaussian, and I kept my response as continuous. This appeared to work, but when running his predict code...I got 438 predictions (instead of 22k).
    -Still not sure if the central issue would reappear; that is, that I can't use any nlp model in R on new data if the new data has new words.

  3. There's tidytext, which lets me make a dm via tf-idf or more naive just fine, but not sure about how that output format works as input feature into ANY model that won't run into the same main problem re: new words.

To re-iterate, my question is, is it possible to use a text feature (document matrix feature) trained model on new data for test or production that contains new/unseen words, in R? Is this possible in NLP in general (think this is probably a yes, but the guy on reddit was so confident).

I know this is not really a typically well-structured question, so please let me know if I can improve it.



Hmmm, I think I understand what you're referring to, but without an example it's hard to be sure. Could you please add a reproducible example? At the very least, describe in detail 1) your training set, i.e., your inputs and your output, and 2) your objective. Thanks!

I understand how to "write a good dev quetion." This question is abstracted. It is package or function agnostic. In other words, it is a general question if this is possible in R, in any way. I'd be happy for any blog or github showing a model trained in R (any package! I just want 1!) that doesn't break if used on new data with unseen words.

See how this is easily achieveed in sklearn @ 45:00-47:00: See v basic laplace in sklearn: I can't believe that we can't do this in R (this is what I was told on reddit).

I'm not sure why people don't like being asked to provide a reproducible example. A question which starts with

is not very abstracted (but that's actually a good thing! See below). Also, expecting potential answerers to go through a multi-page link ( and then having to guess what exactly you have tried and where you got stuck, is far less likely to result in any kind of answer, than showing us actual code which we can reproduce, and explaining point by point which part worked and which one didn't.

I don't think your question is very clear, but if I interpreted you correctly, then you asked "what's the general R solution to the age-old problem of Out-Of-Vocabulary words in NLP?". This is a bit like asking "What can I do to reduce the generalization gap of a neural network?": very abstract, and thus very hard (maybe even impossible) to answer. There isn't a single solution to OOV in NLP which works for all kinds of NLP tasks ( Reading comprehension; Natural language inference; Machine translation; Named entity recognition; Constituency parsing; Language modeling; Sentiment analysis; Skip-thoughts; Autoencoding; etc.). But there could be a good answer to a more specific question concerning how to deal with OOV for a particular task and on a specific data set, in the same way as the very specific question "What can I do to reduce the generalization gap of architecture so and so, on train/test set so and so, with loss function so and so, after having tried X and Y..." is much more likely to get an answer than its more abstract cousin.

A sane approach

Anyway, I'll make an attempt at answering your generic answer. If the task you're attempting to solve can benefit from word embeddings (e.g., language modeling), then the industry-standard solution is to just download Facebook's fastText pretrained word embeddings. You can find them here:
At 1 million words for the English model trained on Wikipedia, or 2 millions for the one trained on Common Crawl , I dare you to find any OOV words at test time :grinning:

Of course, if you are inventing words on the spot (e.g., gearshift), then you won't find them in a pretrained fasttext model. However, you can still use fasttext to infer a word embedding even for an OOV word: in that case you will need one of the models trained with subword information, such as this one. See these examples on how to do that in practice:

and read this paper if you're interested in the theory behind the approach.

Note that fasttext is not a Python module (though a Python wrapper exists: see below), but it's rather a library which you install & build under Linux or OSX. You can easily use it from a bash script, but if you're scared of scripts :slight_smile: you'll have to use reticulate and install the gensim Python library, since gensim includes a Python wrapper for fasttext. I think this may be more complicated and error-prone than just running fasttext as a system command, but if you really want to risk your sanity :slightly_smiling_face: here are two SO posts to help you:

Sadly, I don't think there are R packages which have the same OOV handling capabilities of gensim, but maybe you could try quanteda: Ken Benoit is a nice bloke and if you ask a question on SO with the tags text-mining and quanteda, he may answer it himself (you may even try to drop him a line, if it's just to ask about quanteda OOV capabilities). One thing you'll like for sure about quanteda is how fast it is! Another guy who may be willing to help is Sebastian Ruder: AFAIK, he's mostly a Python user, but his knowledge of modern NLP is so vast that he may still be able to help.

An insane approach

If even fasttext OOV capabilities are not good enough for you, you may try to implement your own Deep Learning model in keras or tensorflow (both available in R) to:

  • learn word embeddings on the fly
  • mimick word embeddings using subword RNNs: note that there's actually code for MIMICK, but that will be useless to you because it's in Python and it uses the NN library dynet: you may risk running it under R using reticulate, but I don't foresee that ending well
  • implement Hybrid Word-Character Models to achieve Open Vocabulary modeling: this has been developed for the Neural Machine Translation task, rather than for language modeling, and again there's code for it, but it's in Matlab, so you'd have to reimplement it in keras or TF
  • (insanity level: :exploding_head::exploding_head::exploding_head::exploding_head::exploding_head:): you could actually use BERT to learn embeddings for OOV words based on context. BERT is written in Tensorflow, thus you may be able to run it with the tensorflow package, and since it's the most advanced language understanding model available, I'm willing to bet that it won't have any issues with your OOV words. However, this is the NLP equivalent of killing a fly with the Death Star!
1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.