Step-wise reporting/ flowcharts of data exclusions RMarkdown

Hi RStudio Community,

When applying multiple exclusion criteria to a dataset, I often want to report the number of observations after each exclusion, either in text or in a flowchart in the Rmarkdown.

However, in my typical data cleaning workflow, I apply all of my exclusion criteria/filters before saving to a new object, which does not allow me to report intermediate numbers (see codechunk combined-filter).

My current workaround is to either:

  1. Create a new dataframe for each filter (see codechunk stepwise-filter-multiple-df), or
  2. Resave into the same dataframe for each filter after saving out the number (see codechunk stepwise-filter-multiple-df)

However, neither looks particularly tidy, and the former could add up in memory if the dataframe is large and there are numerour exclusion steps.

How do you tackle reporting step-wise on data exclusions? Any best practices or suggestions are appreciated!

I checked out Emily Riederer's RMarkdown Driven Development (RmdDD) and documentation for some flowchart packages, e.g. PRISMAstatement but have yet to find any suggestions.

Sample Rmd

*since I'm not sure how to reprex an Rmd

knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(dplyr)
library(DiagrammeR)

filtered_mtcars <-
  mtcars %>% 
  filter(hp < 150) %>% 
  filter(wt < 3) %>% 
  filter (cyl > 4)

The original mtcars dataset has r nrow(mtcars) observations. We removed r nrow(mtcars) - nrow(filtered_mtcars) observations with a horsepower under 150, weight under 3000lbs, or fewer than 5 cylinders. The filtered dataset has r nrow(filtered_mtcars) observations.



v1 <-
  mtcars %>% 
  filter(hp < 150) 

v2 <-
  v1 %>% 
  filter(wt < 3) 

filtered_mtcars <-
  v2%>% 
  filter (cyl > 4)

The original mtcars dataset has r nrow(mtcars) observations. We removed r nrow(mtcars) - nrow(v1) observations with a horsepower under 150, r nrow(v1) - nrow(v2) observations with weight under 3000lbs, and r nrow(v2) - nrow(filtered_mtcars) observations with fewer than 5 cylinders. The filtered dataset has r nrow(filtered_mtcars) observations.



filtered_mtcars <-
  mtcars %>% 
  filter(hp < 150)

hp_1 <- nrow(filtered_mtcars)

filtered_mtcars <-
  filtered_mtcars %>% 
  filter(wt < 3) 

wt_2 <- nrow(filtered_mtcars)

filtered_mtcars <-
  filtered_mtcars %>%
  filter (cyl > 4)

The original mtcars dataset has r nrow(mtcars) observations. We removed r nrow(mtcars) - hp_1 observations with a horsepower under 150, r hp_1 - wt_2 observations with weight under 3000lbs, and r wt_2 - nrow(filtered_mtcars) observations with fewer than 5 cylinders. The filtered dataset has r nrow(filtered_mtcars) observations.

DiagrammeR::grViz("digraph {
  graph [layout = dot, rankdir = TB]

  node [shape = rectangle]
  rec1 [label = 'Original mtcars (n = @@1)']
  rec2 [label = 'Horsepower >= 150 (n = @@2)']
  rec3 [label =  'Weight <=3 (n = @@3)']
  rec4 [label = 'Cyl > 4 (n = @@4)']

  # edge definitions with the node IDs
  rec1 -> rec2 -> rec3 -> rec4
  }
  
  [1]: nrow(mtcars)
  [2]: hp_1
  [3]: wt_2
  [4]: nrow(filtered_mtcars)
  ")

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.