issue making reproducible (identical) Word docx output

I am having issues making a reproducible R markdown Word (docx) document. By reproducible, I mean that if I run the exact same Rmd file twice, I would like to run a diff on the output (at the linux command prompt) and have the output be identical.

Consider the simple Rmd file included at the end of this query, which I saved as "test.Rmd". I can knit this file to generate test.docx, move test.docx to test1.docx, and then knit the Rmd file again. Now I have 2 files which should be the same: test.docx and test1.docx. However, running "diff test.docx test1.docx" indicates that the output is different - I get the message "Binary files test.docx and test1.docx differ".

How can I change the knit settings to make the docx output identical for this case?

Here is the version of test.Rmd I am using:

---
title: "Test Reproducibility"
author: "My Name"
date: "12/17/2020"
output: word_document
---

## R Markdown

This is an R Markdown document.

If you open the docx with a decompression program such as 7zip (yes, a .docx file is just a zipped folder with xml files in it), you can pinpoint the exact source of the difference.

In your example document, you will find that all the files are identical, except for docProps/core.xml. And if you open that xml file, it might look like that:

<?xml version="1.0" encoding="UTF-8"?>

-<cp:coreProperties xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dcmitype="http://purl.org/dc/dcmitype/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cp="http://schemas.openxmlformats.org/package/2006/metadata/core-properties">
<dc:title>Test Reproducibility</dc:title>
  <dc:creator>My Name</dc:creator>
  <cp:keywords/>
  <dcterms:created xsi:type="dcterms:W3CDTF">2020-12-18T02:03:14Z</dcterms:created>
  <dcterms:modified xsi:type="dcterms:W3CDTF">2020-12-18T02:03:14Z</dcterms:modified>
</cp:coreProperties>

And if you compare files test and test1, you will see the difference is in the creation and modification dates. And these are not the same as the ones in the filesystem: if you use touch (on a UNIX system), you change the modification date at the system level, but not the one saved inside the file. The only possibility could be if PANDOC provided an option to lie on that date, but I find it doubtful, and a quick search didn't show me anything.

So, I would suggest that if you want to compare a few files, you just open them in Word and run Compare Version of a Document which should tell you "no difference", or if you need to automate it, you might use some bash magic to unzip the files and compute the md5 for everything except that core.xml.

There should also be a way to do that comparison programmatically with COM, or .NET stuff. Or you can write an R function that does that:


file1 <- officer::read_docx("path/to/test.docx")
file2 <- officer::read_docx("path/to/test1.docx")

waldo::compare(file1, file2)
#> old$package_dir vs new$package_dir
#> - "PATH\\AppData\\Local\\Temp\\Rtmp6ZSKdx\\file5c74177821d4"
#> + "PATH\\AppData\\Local\\Temp\\Rtmp6ZSKdx\\file5c74714776c4"
#> 
#>      old$doc_properties$data | new$doc_properties$data     
#> [16] "Test Reproducibility"  | "Test Reproducibility"  [16]
#> [17] "My Name"               | "My Name"               [17]
#> [18] ""                      | ""                      [18]
#> [19] "2020-12-18T02:03:14Z"  - "2020-12-18T02:02:54Z"  [19]
#> [20] "2020-12-18T02:03:14Z"  - "2020-12-18T02:02:54Z"  [20]

file1$doc_properties$data <- file1$doc_properties$data[-c(4,5),]
file2$doc_properties$data <- file2$doc_properties$data[-c(4,5),]

file1$package_dir <- NULL
file2$package_dir <- NULL

waldo::compare(file1, file2)
#> v No differences

Created on 2020-12-17 by the reprex package (v0.3.0)

3 Likes

Thanks for the help! I plan to store the docx file in a version control repository (subversion or git), and would like to modify the docx file to update the offending information you pointed out. Otherwise, the version control will think that the binary file has changed, when in reality it hasn't.

If I follow your R script and try to save the updated Word document (e.g. 'print(file1,target="test.docx")'), Word complains that the Word file is corrupted and won't open it. It appears that the unique package directory (i.e., file1$package_dir) is needed to correctly save the Word document. Any ideas on how to address?

Short version: in svn/git that shouldn't be a problem, they don't use the same diff. See the result of:

git diff test1.docx test2.docx

Yes, the R script part is not a good idea: I simply deleted the metadata that was different, no wonder the resulting file is seen as corrupted.

Version control is mostly meant to work with text files, not so much with binary (and since the docx is zipped, it's binary). My main suggestion is to save the report in a text format (e.g. markdown or latex) and save that to version control, and use that to compare versions. If you also want the Word document, you could use a unique datetime in the filename and save the Word and text alongside each other with the same filename, you can use the text file to see if content is different, and the Word file to download. In that case you may also need to tell svn/git that the Word file is a binary (see the Google results on that problem).

That being said, I think both git and svn have some support of docx by now, and for example:

git diff test1.docx test2.docx

does give me no difference. Of course you can only run that on a computer with git installed. I think svn can also do that.

Unfortunately svn diff finds that the updated Word document is different, likely because of the differences you identified. I have not checked git.

This issue is quite frustrating from a reproducibility standpoint - it is not unreasonable to store the final compiled report in a version control repository, and one should be able to identically reproduce that report.

Regardless, your insights have been very helpful in elucidating exactly what the problem is - thank you!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.