Looking for advice or repositories for RMarkdown data projects



I'm a complete newbie to R Markdown. I've done a ton of reading on the different ways to setup R Markdown files, and have decided that I want to use my external data, and .R scripts to source the data. I then plan to build the tables for my analysis with code in the R Markdown file.

2 questions:

  1. How does everyone list the source for a table (meaning what script the data comes from) in their R Markdown file?
  2. Does anyone have any resources or repos on github that they can point me to in order to see a real file Data Analysis project that uses raw data, r scripts, and R Markdown?

Thanks in advance.


On 2: There is a whole book on R Markdown: R Markdown: The Definitive Guide. That is an excellent place to start.

On 1: Not quite sure what you are asking. You use ``` to enclose the chunks for R code inside R Markdown. Are you trying to incorporate a separate file?


You need to start by actually reading the excellent, online R Markdown book, and esp. review the examples.

Before attempting with your own code, make sure you actually understand what the explained examples are doing!


Hi @realhiphop! Welcome!

On question 1, are you asking about how you would cite the source for your tables in your report text? Or how you cause the external scripts to run so that the data becomes available for your report code to beautify and present?

And can you explain a bit more about your workflow? Are your external data in a database? Some CSVs? Coming from a third-party API?

The question of how to incorporate data into an R Markdown workflow comes up a lot and there's no one right answer. Here's a previous discussion that might be of interest, now or later:

Your question #2 is a good one, and I think a bit different from looking at synthetic examples (those are also important though!). But I fear people with relevant projects to share might not find this post because the title is fairly uninformative. Maybe considering editing it to focus on your specific questions? "Newbie" is often in the eye of the beholder anyway! :grin:


Thanks so much.
I’ll give a little more detail on my analysis. My project is using sports data. The data comes in a few flavors:

  1. API Pulls
  2. .csv files
  3. Data Frames that I’ve created and exported after cleaning up in dplyr
    I also have R Scripts that I’ve created to do some of the analysis in addition to table joining between different data sources.
    My plan is to use some of the scripts I’ve created that either get data, or synthesize data and load the resulting data frames into R Markdown. I’m then planning to beautify the results.
    For question 1: The data being used is going to be coming from specific R scripts (in the case that I import .rds files with data frames export from an R script).


The workflow that works for me (as usual - your mileage may vary) is the following:

  1. start the rmd document with an init chunk, which runs very silently

{r init, echo = F, eval = T, message = F, warning = F}

This chunk loads all the data - be it from csv, database or by sourcing other R files.

It does so quietly - so the output does not find its way into the final document - and therefore it has messages and warnings turned off.
In addition I am often forced to wrap its content in capture.output( { ... }, file = '/dev/null') to stop any output bleeding into my final document.

As this chunk has warnings and errors supressed I found it good practice to limit it to loading data & making sure to close all database connections. Full stop.

  1. continue with other chunks, that do the real work using the data.frames loaded earlier.

It is normal that something breaks in the "other" chunks from time to time. That is life. But I found it advantageous to make sure the data is loaded in isolation, and that I am not left with any open database connections when stuff breaks.


Was hoping to get some hits after changing the title!

With regards to sources, I was looking for advice. I have scripts for data acquisition, and scripts that do analysis with that data. My plan is to call in the relevant data frames via .rds.

I've been thinking about how do I document where the .rds came from in the R markdown file so that if I pick up the file next year, I know which .r script the R file came from so that I can run it to refresh the data next year.

Is commenting out the name of the .R script on top of the .rds file the best approach?

I'm trying to get better about my documentation practices as I build scripts.