Best Practices: how to prepare your own data for use in a `reprex` if you can’t, or don’t know how to reproduce a problem with a built-in dataset?


#1

@EconomiCurtis split this out of FAQ: What's a reproducible example (`reprex`) and how do I do one?.

Curious if you have anything additional to add specifically on “how to prepare your own data for use in a reprex if you can’t, or don’t know how to reproduce a problem with a built-in dataset.


I think @jessemaegan’s post is about 80% there. The piece it is missing, if your average stack overflow post is any indication, is an explanation about how to prepare your own data for use in a reprex if you can’t, or don’t know how to reproduce a problem with a built-in dataset.

Some handy things to know for this situation:

  1. deparse()
    The ugly as sin, gold standard:
head(my_data, 2) %>%
  deparse()

returning something like:

structure(list(date = list(structure(-61289950328, class = c("POSIXct", 
"POSIXt"), tzone = ""), structure(-61258327928, class = c("POSIXct", 
"POSIXt"), tzone = "")), id = c("0001234", "0001235"), ammount = c("$18.50", 
"-$18.50")), class = "data.frame", .Names = c("date", "id", "ammount"
), row.names = c(NA, -2L))

Which is not beginner friendly… what’s a structure? But it is really the only method that will not mess with the data types. It also works with both listy structures and data.frame-ish ones.

  1. tibble::tribble()
    Handy if you have the patience to hand type out a some data for your audience in a pretty format. There is a servere limitation in that not all data types can be represented in a tribble(). The previous would be something close to:
tibble::tribble(
               ~date,       ~id,  ~ammount,
  "27/10/2016 21:00", "0001234",  "$18.50",
  "28/10/2016 21:05", "0001235", "-$18.50"
  ) %>%
  mutate(date = lubridate::parse_date_time(date, orders = c("d!/m!/Y! H!:M!")))

With the trailing mutate to fix the date that could not be represented. It would be remiss of me not to plug datapasta::tribble_paste() which can save you some typing here.

  1. readr::read_csv()
    It’s possible to represent your data, complete with type specification, as a read_csv() call. The previous would be:
readr::read_csv('date, id, amount
"27/10/2016 21:00", 0001234,  $18.50
"28/10/2016 21:05", 0001235, -$18.50',
  col_types = cols( col_date(format="%d/%m/%Y %H:%M"),  
    col_character(), col_character() )
)
  1. krlmlr/deparse
    Not yet on CRAN, A nicer version of 1, that can also get you directly to 2. in some cases. https://github.com/krlmlr/deparse

Edit: you can always use data.frame(), Tibble(), list() etc!


FAQ: What's a reproducible example (`reprex`) and how do I do one?
Correct workflow reading in local file for reprex
Linking CRSP and Compustat in R
How to average/mean variables in R based on the level of another variable (and save this as a new variable)?
Create Volcano Plot
Demeaning / Mean-Centering of certain values only
How to get percentage and then create a plot from a given output
Cannot see the complete plot for all values of variable
FAQ: What's a reproducible example (`reprex`) and how do I do one?
FAQ: What's a reproducible example (`reprex`) and how do I do one?
Issue with dput()
Issue with dput()
#2

dput(., control = NULL) can often be a bit clearer.


#3

The SO prevalent advice is to use dput() for more complicated data, but posts should strive to make a workable, less complicated example if possible.


#4

Yes you’re totally right. I forgot about this becuase I usually do something like:
my_data %>% deparse() %>% clipr::write_clip()

Which places it on the clipboard. dput will make a nicer output for manual copying :+1:


#5

For sharing simple data.frames (those containing only basic types, no dates, no factors, and no row names) I suggest using wrapr::draw_frame() to build sharable examples.

For example suppose our example was the following data.

d <- head(ggplot2::diamonds) 

wrapr::draw_frame can share this data in a very legible form:

library("wrapr")
cat(draw_frame(d))

This outputs the following (older versions of wrapr do not add the "::" qualifier).

wrapr::build_frame(
   "carat", "cut"      , "color", "clarity", "depth", "table", "price", "x" , "y" , "z"  |
   0.23   , "Ideal"    , "E"    , "SI2"    , 61.5   , 55     , 326L   , 3.95, 3.98, 2.43 |
   0.21   , "Premium"  , "E"    , "SI1"    , 59.8   , 61     , 326L   , 3.89, 3.84, 2.31 |
   0.23   , "Good"     , "E"    , "VS1"    , 56.9   , 65     , 327L   , 4.05, 4.07, 2.31 |
   0.29   , "Premium"  , "I"    , "VS2"    , 62.4   , 58     , 334L   , 4.2 , 4.23, 2.63 |
   0.31   , "Good"     , "J"    , "SI2"    , 63.3   , 58     , 335L   , 4.34, 4.35, 2.75 |
   0.24   , "Very Good", "J"    , "VVS2"   , 62.8   , 57     , 336L   , 3.94, 3.96, 2.48 )

The point is, with the wrapr package loaded the above output is actually executable code that produces the same data.frame. One can then copy and paste the above code to start a fresh example from this data (and not need to include steps that took one to this point).

(Was asked to post this to this thread here.)


#6

Nice feature!

datapasta :package: has something very useful and similar

You could do in a script

d <- head(ggplot2::diamonds)
datapasta::tribble_paste(d)

and the command will output a tribble call using the clipboard right at your cursor position ! Very useful for reproductibility when preparing a reprex.

datapasta::tribble_paste(d)
tibble::tribble(
  ~carat,        ~cut, ~color, ~clarity, ~depth, ~table, ~price,   ~x,   ~y,   ~z,
    0.23,     "Ideal",    "E",    "SI2",   61.5,     55,   326L, 3.95, 3.98, 2.43,
    0.21,   "Premium",    "E",    "SI1",   59.8,     61,   326L, 3.89, 3.84, 2.31,
    0.23,      "Good",    "E",    "VS1",   56.9,     65,   327L, 4.05, 4.07, 2.31,
    0.29,   "Premium",    "I",    "VS2",   62.4,     58,   334L,  4.2, 4.23, 2.63,
    0.31,      "Good",    "J",    "SI2",   63.3,     58,   335L, 4.34, 4.35, 2.75,
    0.24, "Very Good",    "J",   "VVS2",   62.8,     57,   336L, 3.94, 3.96, 2.48
  )

If run in a script, the output will be paste in the script, if in the console it will be paste in the console.

One advantage is not need to have another package than tibble to recreate the data.frame/tibble. datapasta is only needed to generate the data.frame object as a tribble call. Nice features!

datapasta::tribble_construct outputs a string that can be print with cat.

There is also df_paste and df_construct for data.frame call creation. And also other feature that one can discover in datapasta


#7

datapasta looks neat. It definitely should get more attention.

I'd just say if one is debugging a data.table issue then not having to have tibble active is a similar advantage (one less possible source of interference).


#8

Thanks @cderv!

I didn't mention datapasta in the original write up because it will silently convert complex objects it can't write in a tribble to character. I thought this might be confusing for people new to this type of thing. However it tries to make up for that with convenience.

Small note: When you call datapasta *_paste functions with arguments they just write directly to the active source pane or console without going via the clipboard.


#10

Good to know! thanks for the precision. I was confused by the _paste suffix. :wink:


#11

I agree!
Adding a datatable_paste and friends in datapasta could be helping for those users. There are currently just data.frame and tibble.


#12

Good idea. data.table would be a worthwhile addition.