My team and I are working on an individual report for all the students in the state. But because our R markdown uses CSS, HTML, and also latex each report take a time to run. When we ran the report for school earlier this year the process took 6-8 hrs (250 schools). My question is there any way we could run the process faster? The server has 16 processors, 16 GB ram... The outputs are word which is then converted to pdf using another python script.
You might want to copy the R code into an R script as an experiment. You can then use the profiler to find out what part of the code is running slowly. The odds are that it is the R code rather than the markdown that is taking the time.
You can likely figure out what's going on without showing us the code. Use the profile function built into RStudio. You will likely find that it's just a few lines of code taking most of the time. Then you might post just those line of code for suggestions, or make part of the code parallel as @andresrcs suggests.
We build a few in-house functions. We loop through a CSV file to get the information of all the students, we also loop through another file to get the information on all the schools. using the function we build an HTML document that is converted at a later date. I think your idea of using all the cores might work best
It'll be hard to make concrete suggestions without some idea which parts of the report are taking long to generate. Since you mention "looping through a CSV," it's possible you are repeating calculations for each report which would be faster to perform once, save, and then selectively pull out with a quick filter for each report.
Regarding optimization like this, and specifically on rmarkdown, it is important to know what takes time.
Is this the computation of any results and graphics ?
Is this the conversion of the R Markdown file to the output ?
With R Markdown, most common workflow is to have everything calculated within the chunks. This works fine for one report, or multiple report if all the data used are pre computed or unique to the parametrized value.
Another workflow would be a two step process:
Workflow with R Script to prepare all the data. This would output some databases kind of result in CSV files, or other optimized format. This could be optimized using parallelisation and other technics to make thinks quicker. Tables and plots could be precomputed in this step if that helps.
Then using parametrized repo, the rendering of Rmd file would be essentially some request of the previous results and plot insertion.
This is the kind of workflow that uses caching of result in the first step, so that rendering the publication does not recompute everything if not needed. In you case, what does each rendering of Rmd needs to really recompute and what can be shared accross all documents ? The latter should be done once only for all report. This will save time.
There are tooling to help with such workflow:
First simple one, is the caching mechanism supported by rmarkdown and provided by knitr
This tool would help you setup a whole workflow from data entry to report rendering. It allows to define steps with input and outputs, and this allow support parallelisation and caching - it will be clever enough to know what can be parallelize and what have changed or not so when to use caching value.
In my experience, making such use case quicker when rendering several R Markdown documents is a matter of workflow. Obviously a prerequisite is to know what is the bottle neck, and where rely the best candidate for optimization (small improvement to reduce a lot of time).
So after inspecting the entire process I have realized that the process of converting the HTML to PDF is making the process long and drawn out. Is there a library or website you can point me to that converts HTML to PDf in which case I could use one of the methods described above?
Currently, we are using python, they have a package called wkhtmltopdf. BUT, the formatting changed when we were using this process on the local machine-- hence we had to move the process to Docker. this has kept the formatting but would like to put everything R if that can be down. Could pandoc be a solution?
It exists option to print the HTML as PDF using chrome. pagedown::chrome_print() can do that from R .
However, the layout may not be the perfect as you may need maybe special CSS media print rule to format. But I don't how wkhtmltopdf works, and I guess it is already the case. pagedown is an R package offering support in R Markdown for Paged.js - you can see about there
It allows to create paginated HTML document ready to be printed. Certainly an investement but if you do that a lot, having a template could be valued.
Some tools like pagedreport offers some template, and shows what could be done
Printing HTML to PDF is a good option. Maybe wkhtmltopdf works well - I think Pandoc supports it in --pdf-enginebut I don't know if it could be more efficient than Python module. So converting HTML to PDF using Pandoc could be an option to.