Parse all tables in a .docx file using Pandoc's grid_tables extension.


I'm working on a project where I receive .docx files which I parse and then reconstruct in a parameterised .Rmd file. The .docx file contains many tables, and the contents of these tables varies considerably between the template files. This makes parsing them difficult because in some reports the Pandoc simple_tables extension is used, whereas in others grid_tables is required.

The question

There's a fully working repo here that demonstrates the issue, but I'll also include a reprex directly in this question below.

Let's download the .docx file to our working directory with {here}



The screenshot below shows there are two tables in the document

I use rmarkdown::pandoc_convert() to convert the .docx to a markdown file

  input = here("word-doc.docx"),
  to = "markdown",
  output = here("Example Word")

But as you can see, each table is in a different format:

readLines(here("Example Word"))
#>  [1] "# Example Word Report"                                                
#>  [2] ""                                                                     
#>  [3] "The table below will be parsed using Pandoc's simple_tables extension"
#>  [4] ""                                                                     
#>  [5] "  State        Times visited"                                         
#>  [6] "  ------------ ---------------"                                       
#>  [7] "  California   5"                                                     
#>  [8] "  Nevada       2"                                                     
#>  [9] ""                                                                     
#> [10] "The table below will be parsed using Pandoc's grid_tables extension"  
#> [11] ""                                                                     
#> [12] "+------------+--------------------------+"                            
#> [13] "| States     | Things done in the state |"                            
#> [14] "+============+==========================+"                            
#> [15] "| California | -   Disney Land          |"                            
#> [16] "|            |                          |"                            
#> [17] "|            | -   Universal Studios    |"                            
#> [18] "+------------+--------------------------+"                            
#> [19] "| Nevada     | -   Excalibur            |"                            
#> [20] "|            |                          |"                            
#> [21] "|            | -   Flamingos            |"                            
#> [22] "+------------+--------------------------+"

How can I tell rmardown::pandoc_table() to always use the grid_tables extension or what alternative approaches are there to end up with a .md file with only grid_tables?

Thanks a lot for the reproducible example: clear and simple ! Really helps!

I think you just need to tell pandoc that you want a markdown format that does not support the tables you don't want. By table format, I mean the tables extensions.

If you do this, it should be ok

  input = here("word-doc.docx"),
  to = "markdown-simple_tables-pipe_tables-multiline_tables",
  output = here("Example Word")

markdown-simple_tables-pipe_tables-multiline_tables means to output to markdown but deactivate the extensions for unwanted table format. They all are activated by default, and selected depending on the content of your table I think. If you keep only one, Pandoc has no choice to make and will use the last one : grid_tables

Docs about these extensions is here: Pandoc - Pandoc User’s Guide
About the extension syntax: Pandoc - Pandoc User’s Guide

Does it output what you want ?

1 Like

Thanks - this perfectly solves my needs :grinning:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.