I am hoping to get some advice/assurance on parsing some text data in R.
To the best of my Googling, there’s no obvious way to generate a parser in R based on an (E)BNF grammar of some sort (happy to be corrected about this ). I recently came across
Reprex: checking a French-to-English dictionary
The data that I work with are dictionaries formatted as backslash-coded lines, which is a relatively common format within [endangered] language documentation work (see a longer example here). Below, I’ve made a toy French-to-English dictionary:
library(tidyverse) library(zoo) library(V8) lexicon <- '\\lx rouge \\ps adjective \\de red \\xv La chaise est rouge \\xe The chair is red \\lx bonjour \\de hello \\ps exclamation \\lx parler \\ps verb \\de speak \\xv Parlez-vous français? ' lexicon_df <- read_lines(file = lexicon) %>% tibble(line = 1:length(.), data = .) %>% extract(col = data, regex = "\\\\([a-z]+)\\s(.*)", into = c("code", "value"), remove = F) %>% mutate(lx_id = ifelse(code == "lx", line, NA) %>% na.locf(na.rm = F))
I’ve found tidyverse a great way to work with a lot of aspects of the data, so a lot of my workflow consists of working on a data frame that looks like:
|4||\xv La chaise est rouge||xv||La chaise est rouge||1|
|5||\xe The chair is red||xe||The chair is red||1|
For example, I can use
assertr::verify to make sure all the parts of speech values (adjective, noun, etc.) in the
ps codes are valid. Other than value validation, validation of the order of the
code column is also something important to check, and this is the part I haven’t quite worked out how to do [well] in R.
Question/code review: how can the following be done better?
Following Jeroen Ooms’s ‘Using NPM packages in V8’ vignette, I experimented writing a
compile_grammar R function (GitHub gist here). The function takes a Nearley grammar, such as
lexicon_grammar below, and uses V8 and Nearley to compile the grammar into “R code”:
lexicon_grammar <- ' entry -> "lx" _ "ps" _ "de" _ examples:? examples -> ("xv" _ "xe" _):+ _ -> " " | null ' source("https://git.io/vAFux") # source compile_grammar function from GitHub gist parser <- compile_grammar(lexicon_grammar)
To check whether our dictionary entries are valid, we can use the generated
parser function within a
lexicon_df %>% filter(!is.na(code)) %>% group_by(lx_id) %>% summarise(code_sequence = paste0(code, collapse = " ")) %>% rowwise() %>% mutate( parsed_sequence = parser(code_sequence, stop_on_error = F), valid_sequence = is.list(parsed_sequence) )
|1||lx ps de xv xe||list(“lx”, " ", “ps”, " ", “de”, " ", list(list(list(“xv”, " ", “xe”, character(0)))))||TRUE|
|7||lx de ps||Error: invalid syntax at line 1 col 4: lx de ps
^ Unexpected “d”
|11||lx ps de xv||Error: Parse incomplete, expecting more text at end of string: ‘lx ps de xv’||FALSE|
As we can see, only our
\lx rouge ... entry block is valid within the grammar. The 2nd item,
\lx bonjour ... has its
de lines inverted, and the third is missing a required English sentence
xe for its example sentence,
\xv Parlez-vous français?.
I was wondering if anyone knew a more robust/R-native way to do the same/a similar thing. One issue I’ve already encountered with using
V8 is that the package uses an older version of the v8 engine, so this method isn’t quite able to fully take advantage of the Nearley parsing toolkit, and also compiling not-so-toy-example grammars is actually quite frustrating .
Thanks for reading!