Best practices for generating parsers in R


Hi all—

I am hoping to get some advice/assurance on parsing some text data in R.


To the best of my Googling, there’s no obvious way to generate a parser in R based on an (E)BNF grammar of some sort (happy to be corrected about this :tada:). I recently came across ropenscilabs/gramr, and saw that the package was using a Javascript package write-good to do the heavy lifting (nifty!). So, I thought I’d try out a similar thing using Nearley, a Javascript parser generator toolkit.

Reprex: checking a French-to-English dictionary

The data that I work with are dictionaries formatted as backslash-coded lines, which is a relatively common format within [endangered] language documentation work (see a longer example here). Below, I’ve made a toy French-to-English dictionary:


lexicon <-
'\\lx rouge
\\ps adjective
\\de red
\\xv La chaise est rouge
\\xe The chair is red

\\lx bonjour
\\de hello
\\ps exclamation

\\lx parler
\\ps verb
\\de speak
\\xv Parlez-vous français?

lexicon_df <-
    read_lines(file = lexicon) %>%
    tibble(line = 1:length(.), data = .) %>%
    extract(col = data,
            regex = "\\\\([a-z]+)\\s(.*)",
            into = c("code", "value"),
            remove = F) %>%
    mutate(lx_id = ifelse(code == "lx", line, NA) %>% na.locf(na.rm = F))

I’ve found tidyverse a great way to work with a lot of aspects of the data, so a lot of my workflow consists of working on a data frame that looks like:

line data code value lx_id
1 \lx rouge lx rouge 1
2 \ps adjective ps adjective 1
3 \de red de red 1
4 \xv La chaise est rouge xv La chaise est rouge 1
5 \xe The chair is red xe The chair is red 1
6 NA NA 1

For example, I can use assertr::verify to make sure all the parts of speech values (adjective, noun, etc.) in the ps codes are valid. Other than value validation, validation of the order of the code column is also something important to check, and this is the part I haven’t quite worked out how to do [well] in R.

Question/code review: how can the following be done better?

Following Jeroen Ooms’s ‘Using NPM packages in V8’ vignette, I experimented writing a compile_grammar R function (GitHub gist here). The function takes a Nearley grammar, such as lexicon_grammar below, and uses V8 and Nearley to compile the grammar into “R code”:

lexicon_grammar <- '
entry    -> "lx" _ "ps" _ "de" _ examples:?

examples -> ("xv" _ "xe" _):+

_        -> " " | null

source("") # source compile_grammar function from GitHub gist
parser <- compile_grammar(lexicon_grammar)

To check whether our dictionary entries are valid, we can use the generated parser function within a mutate call:

lexicon_df %>%
    filter(! %>%
    group_by(lx_id) %>%
    summarise(code_sequence = paste0(code, collapse = " ")) %>%
    rowwise() %>% 
        parsed_sequence  = parser(code_sequence, stop_on_error = F),
        valid_sequence   = is.list(parsed_sequence)
lx_id code_sequence parsed_sequence valid_sequence
1 lx ps de xv xe list(“lx”, " ", “ps”, " ", “de”, " ", list(list(list(“xv”, " ", “xe”, character(0))))) TRUE
7 lx de ps Error: invalid syntax at line 1 col 4: lx de ps
                                                           ^ Unexpected “d”
11 lx ps de xv Error: Parse incomplete, expecting more text at end of string: ‘lx ps de xv’ FALSE

As we can see, only our \lx rouge ... entry block is valid within the grammar. The 2nd item, \lx bonjour ... has its ps and de lines inverted, and the third is missing a required English sentence xe for its example sentence, \xv Parlez-vous français?.

I was wondering if anyone knew a more robust/R-native way to do the same/a similar thing. One issue I’ve already encountered with using V8 is that the package uses an older version of the v8 engine, so this method isn’t quite able to fully take advantage of the Nearley parsing toolkit, and also compiling not-so-toy-example grammars is actually quite frustrating :weary:.

Thanks for reading!