NLP packages for specific measurements

Hello, for my MSc thesis I am trying to find methods to assess text quality, I am fairly new to text mining with R. I do not currently need any programming help, I am seeking package information.

Certain text quality measurements i have found in R packages (Quanteda, tidytext, OpenNLP, and qdap) however there are others that I cannot find which leads me here.

i cannot seem to find an R package that calculates 1) percentage of grammatical errors 2) percentage of abbreviations (specifically in a medical context)

Any help would be appreciated.

For abbreviations, I haven't found an R package using rseek. There are, of course, many glossaries from which you could prepare a list to throw your corpus at.

The first may be harder. You should look at the CRAN NLP Task View packages. However, most of what I've seen in the way of grammatical error detection seems to be done in Java or Python.

1 Like

Text quality (like any quality, as opposed to quantity) can be devilishly difficult to asses programatically.

I have two suggestions:

  • have a look at the Text Mining book (if you haven't already); you will find a lot of inspiration there: https://www.tidytextmining.com/
  • consider, especially if English is not your primary language of interest, also the udpipe package. It has many language agnostic functions, and a powerful lemmatizer.
2 Likes

I did not know about CRAN NLP, thank you this is excellent!

thank you kindly for your help, i know about tidytext but will try to explore further options.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.