Best Practices: how to prepare your own data for use in a `reprex` if you can’t, or don’t know how to reproduce a problem with a built-in dataset?

@EconomiCurtis split this out of FAQ: What's a reproducible example (`reprex`) and how do I do one?.

Curious if you have anything additional to add specifically on "how to prepare your own data for use in a reprex if you can't, or don't know how to reproduce a problem with a built-in dataset."


I think @jessemaegan's post is about 80% there. The piece it is missing, if your average stack overflow post is any indication, is an explanation about how to prepare your own data for use in a reprex if you can't, or don't know how to reproduce a problem with a built-in dataset.

Some handy things to know for this situation:

  1. deparse()
    The ugly as sin, gold standard:
head(my_data, 2) %>%
  deparse()

returning something like:

structure(list(date = list(structure(-61289950328, class = c("POSIXct", 
"POSIXt"), tzone = ""), structure(-61258327928, class = c("POSIXct", 
"POSIXt"), tzone = "")), id = c("0001234", "0001235"), ammount = c("$18.50", 
"-$18.50")), class = "data.frame", .Names = c("date", "id", "ammount"
), row.names = c(NA, -2L))

Which is not beginner friendly... what's a structure? But it is really the only method that will not mess with the data types. It also works with both listy structures and data.frame-ish ones.

  1. tibble::tribble()
    Handy if you have the patience to hand type out a some data for your audience in a pretty format. There is a servere limitation in that not all data types can be represented in a tribble(). The previous would be something close to:
tibble::tribble(
               ~date,       ~id,  ~ammount,
  "27/10/2016 21:00", "0001234",  "$18.50",
  "28/10/2016 21:05", "0001235", "-$18.50"
  ) %>%
  mutate(date = lubridate::parse_date_time(date, orders = c("d!/m!/Y! H!:M!")))

With the trailing mutate to fix the date that could not be represented. It would be remiss of me not to plug datapasta::tribble_paste() which can save you some typing here.

  1. readr::read_csv()
    It's possible to represent your data, complete with type specification, as a read_csv() call. The previous would be:
readr::read_csv('date, id, amount
"27/10/2016 21:00", 0001234,  $18.50
"28/10/2016 21:05", 0001235, -$18.50',
  col_types = cols( col_date(format="%d/%m/%Y %H:%M"),  
    col_character(), col_character() )
)
  1. krlmlr/deparse
    Not yet on CRAN, A nicer version of 1, that can also get you directly to 2. in some cases. https://github.com/krlmlr/deparse

Edit: you can always use data.frame(), Tibble(), list() etc!

12 Likes
FAQ: What's a reproducible example (`reprex`) and how do I create one?
Correct workflow reading in local file for reprex
How to average/mean variables in R based on the level of another variable (and save this as a new variable)?
Issue with dput()
Linking CRSP and Compustat in R
How to get percentage and then create a plot from a given output
Define a new column for genre of the movie
R : dynamic plot value and x axis value for line chart
Tidyverse ggplot() help: How to segment each column in the geom_bar()
[draft FAQ] data for a reprex
Problem with titles and subplot
Tidy up a multi-column table based on a single column
Control chart in R
R^2 of regressed line fitted through origin
Coming up with Example/Dummy Datasets for a REPREX
Performing linear regression on thousands of samples
Date does not show up on x-axis in ggplot when ARIMA model is plotted
Struggling to create a scatter plot with time span
Support links to explain two specific steps involved in building a reproducible example
Prepping and importing time series data (for noobs)
Text concatenation
Error reading file
Extracting data from csv file help
Create a subset of a panel data set
Confidence Interval: getting "NA" as an answer. Why?
how to make a graph with a list class variable?
Saving factor scores from grm
how to apply the margins() function within the map() function
how to set my barplots results in ascending order?
I am receiving a replacement error for my loop using tigris function in R
Bargraph By Date Purchased
Graphing Issue - Making Y-axis represent the values of X-axis labels
How to add text to error bar using geom_text
NEW Having a problem with geom_label
Adding a Legend to an Overlay Bar and Line Plot!
In .gd_SetProject(object, ...) : NOT UPDATED FOR PROJ >= 6
Colour analysis
stacked barplot
compact list object in a tidy dataframe
rvest URL Table
'response is constant' error when running glmer?
loop output saving
How can I read a EPW file?
FAQ: What's a reproducible example (`reprex`) and how do I create one?
Replace embedded line feeds in a record without messing up the Carriage Return+Line Feed record delimiter
Cannot see the complete plot for all values of variable
Difference in Difference in R studio
Histogram doesn't work
FAQ: Tips for writing R-related questions
FAQ: What's a reproducible example (`reprex`) and how do I create one?
reprex packages
Demeaning / Mean-Centering of certain values only
NA as output.but the ouput should be Y2014 Y2015
Issue with dput()
Create Volcano Plot

dput(., control = NULL) can often be a bit clearer.

3 Likes

The SO prevalent advice is to use dput() for more complicated data, but posts should strive to make a workable, less complicated example if possible.

3 Likes

Yes you're totally right. I forgot about this becuase I usually do something like:
my_data %>% deparse() %>% clipr::write_clip()

Which places it on the clipboard. dput will make a nicer output for manual copying :+1:

2 Likes

For sharing simple data.frames (those containing only basic types, no dates, no factors, and no row names) I suggest using wrapr::draw_frame() to build sharable examples.

For example suppose our example was the following data.

d <- head(ggplot2::diamonds) 

wrapr::draw_frame can share this data in a very legible form:

library("wrapr")
cat(draw_frame(d))

This outputs the following (older versions of wrapr do not add the "::" qualifier).

wrapr::build_frame(
   "carat", "cut"      , "color", "clarity", "depth", "table", "price", "x" , "y" , "z"  |
   0.23   , "Ideal"    , "E"    , "SI2"    , 61.5   , 55     , 326L   , 3.95, 3.98, 2.43 |
   0.21   , "Premium"  , "E"    , "SI1"    , 59.8   , 61     , 326L   , 3.89, 3.84, 2.31 |
   0.23   , "Good"     , "E"    , "VS1"    , 56.9   , 65     , 327L   , 4.05, 4.07, 2.31 |
   0.29   , "Premium"  , "I"    , "VS2"    , 62.4   , 58     , 334L   , 4.2 , 4.23, 2.63 |
   0.31   , "Good"     , "J"    , "SI2"    , 63.3   , 58     , 335L   , 4.34, 4.35, 2.75 |
   0.24   , "Very Good", "J"    , "VVS2"   , 62.8   , 57     , 336L   , 3.94, 3.96, 2.48 )

The point is, with the wrapr package loaded the above output is actually executable code that produces the same data.frame. One can then copy and paste the above code to start a fresh example from this data (and not need to include steps that took one to this point).

(Was asked to post this to this thread here.)

4 Likes

Nice feature!

datapasta :package: has something very useful and similar

You could do in a script

d <- head(ggplot2::diamonds)
datapasta::tribble_paste(d)

and the command will output a tribble call using the clipboard right at your cursor position ! Very useful for reproductibility when preparing a reprex.

datapasta::tribble_paste(d)
tibble::tribble(
  ~carat,        ~cut, ~color, ~clarity, ~depth, ~table, ~price,   ~x,   ~y,   ~z,
    0.23,     "Ideal",    "E",    "SI2",   61.5,     55,   326L, 3.95, 3.98, 2.43,
    0.21,   "Premium",    "E",    "SI1",   59.8,     61,   326L, 3.89, 3.84, 2.31,
    0.23,      "Good",    "E",    "VS1",   56.9,     65,   327L, 4.05, 4.07, 2.31,
    0.29,   "Premium",    "I",    "VS2",   62.4,     58,   334L,  4.2, 4.23, 2.63,
    0.31,      "Good",    "J",    "SI2",   63.3,     58,   335L, 4.34, 4.35, 2.75,
    0.24, "Very Good",    "J",   "VVS2",   62.8,     57,   336L, 3.94, 3.96, 2.48
  )

If run in a script, the output will be paste in the script, if in the console it will be paste in the console.

One advantage is not need to have another package than tibble to recreate the data.frame/tibble. datapasta is only needed to generate the data.frame object as a tribble call. Nice features!

datapasta::tribble_construct outputs a string that can be print with cat.

There is also df_paste and df_construct for data.frame call creation. And also other feature that one can discover in datapasta

7 Likes

datapasta looks neat. It definitely should get more attention.

I'd just say if one is debugging a data.table issue then not having to have tibble active is a similar advantage (one less possible source of interference).

Thanks @cderv!

I didn't mention datapasta in the original write up because it will silently convert complex objects it can't write in a tribble to character. I thought this might be confusing for people new to this type of thing. However it tries to make up for that with convenience.

Small note: When you call datapasta *_paste functions with arguments they just write directly to the active source pane or console without going via the clipboard.

2 Likes

Good to know! thanks for the precision. I was confused by the _paste suffix. :wink:

I agree!
Adding a datatable_paste and friends in datapasta could be helping for those users. There are currently just data.frame and tibble.

4 Likes

Good idea. data.table would be a worthwhile addition.

5 Likes

I thoroughly agree. Here you go.

3 Likes