Best Practices: how to prepare your own data for use in a `reprex` if you can’t, or don’t know how to reproduce a problem with a built-in dataset?

milesmcbain · February 16, 2018, 7:24am

@EconomiCurtis split this out of FAQ: What's a reproducible example (`reprex`) and how do I do one?.

Curious if you have anything additional to add specifically on "how to prepare your own data for use in a reprex if you can't, or don't know how to reproduce a problem with a built-in dataset."

I think @jessemaegan's post is about 80% there. The piece it is missing, if your average stack overflow post is any indication, is an explanation about how to prepare your own data for use in a reprex if you can't, or don't know how to reproduce a problem with a built-in dataset.

Some handy things to know for this situation:

deparse()
The ugly as sin, gold standard:

head(my_data, 2) %>%
  deparse()

returning something like:

structure(list(date = list(structure(-61289950328, class = c("POSIXct", 
"POSIXt"), tzone = ""), structure(-61258327928, class = c("POSIXct", 
"POSIXt"), tzone = "")), id = c("0001234", "0001235"), ammount = c("$18.50", 
"-$18.50")), class = "data.frame", .Names = c("date", "id", "ammount"
), row.names = c(NA, -2L))

Which is not beginner friendly... what's a structure? But it is really the only method that will not mess with the data types. It also works with both listy structures and data.frame-ish ones.

tibble::tribble()
Handy if you have the patience to hand type out a some data for your audience in a pretty format. There is a servere limitation in that not all data types can be represented in a tribble(). The previous would be something close to:

tibble::tribble(
               ~date,       ~id,  ~ammount,
  "27/10/2016 21:00", "0001234",  "$18.50",
  "28/10/2016 21:05", "0001235", "-$18.50"
  ) %>%
  mutate(date = lubridate::parse_date_time(date, orders = c("d!/m!/Y! H!:M!")))

With the trailing mutate to fix the date that could not be represented. It would be remiss of me not to plug datapasta::tribble_paste() which can save you some typing here.

readr::read_csv()
It's possible to represent your data, complete with type specification, as a read_csv() call. The previous would be:

readr::read_csv('date, id, amount
"27/10/2016 21:00", 0001234,  $18.50
"28/10/2016 21:05", 0001235, -$18.50',
  col_types = cols( col_date(format="%d/%m/%Y %H:%M"),  
    col_character(), col_character() )
)

krlmlr/deparse
Not yet on CRAN, A nicer version of 1, that can also get you directly to 2. in some cases. https://github.com/krlmlr/deparse

Edit: you can always use data.frame(), Tibble(), list() etc!

hughparsonage · February 18, 2018, 12:35pm

dput(., control = NULL) can often be a bit clearer.

jakekaupp · February 18, 2018, 12:38pm

The SO prevalent advice is to use dput() for more complicated data, but posts should strive to make a workable, less complicated example if possible.

milesmcbain · February 19, 2018, 2:20am

Yes you're totally right. I forgot about this becuase I usually do something like:
my_data %>% deparse() %>% clipr::write_clip()

Which places it on the clipboard. dput will make a nicer output for manual copying

JohnMount · June 12, 2018, 7:34pm

For sharing simple data.frames (those containing only basic types, no dates, no factors, and no row names) I suggest using wrapr::draw_frame() to build sharable examples.

For example suppose our example was the following data.

d <- head(ggplot2::diamonds)

wrapr::draw_frame can share this data in a very legible form:

library("wrapr")
cat(draw_frame(d))

This outputs the following (older versions of wrapr do not add the "::" qualifier).

wrapr::build_frame(
   "carat", "cut"      , "color", "clarity", "depth", "table", "price", "x" , "y" , "z"  |
   0.23   , "Ideal"    , "E"    , "SI2"    , 61.5   , 55     , 326L   , 3.95, 3.98, 2.43 |
   0.21   , "Premium"  , "E"    , "SI1"    , 59.8   , 61     , 326L   , 3.89, 3.84, 2.31 |
   0.23   , "Good"     , "E"    , "VS1"    , 56.9   , 65     , 327L   , 4.05, 4.07, 2.31 |
   0.29   , "Premium"  , "I"    , "VS2"    , 62.4   , 58     , 334L   , 4.2 , 4.23, 2.63 |
   0.31   , "Good"     , "J"    , "SI2"    , 63.3   , 58     , 335L   , 4.34, 4.35, 2.75 |
   0.24   , "Very Good", "J"    , "VVS2"   , 62.8   , 57     , 336L   , 3.94, 3.96, 2.48 )

The point is, with the wrapr package loaded the above output is actually executable code that produces the same data.frame. One can then copy and paste the above code to start a fresh example from this data (and not need to include steps that took one to this point).

(Was asked to post this to this thread here.)

cderv · June 12, 2018, 9:38pm

Nice feature!

datapasta has something very useful and similar

You could do in a script

d <- head(ggplot2::diamonds)
datapasta::tribble_paste(d)

and the command will output a tribble call using the clipboard right at your cursor position ! Very useful for reproductibility when preparing a reprex.

datapasta::tribble_paste(d)
tibble::tribble(
  ~carat,        ~cut, ~color, ~clarity, ~depth, ~table, ~price,   ~x,   ~y,   ~z,
    0.23,     "Ideal",    "E",    "SI2",   61.5,     55,   326L, 3.95, 3.98, 2.43,
    0.21,   "Premium",    "E",    "SI1",   59.8,     61,   326L, 3.89, 3.84, 2.31,
    0.23,      "Good",    "E",    "VS1",   56.9,     65,   327L, 4.05, 4.07, 2.31,
    0.29,   "Premium",    "I",    "VS2",   62.4,     58,   334L,  4.2, 4.23, 2.63,
    0.31,      "Good",    "J",    "SI2",   63.3,     58,   335L, 4.34, 4.35, 2.75,
    0.24, "Very Good",    "J",   "VVS2",   62.8,     57,   336L, 3.94, 3.96, 2.48
  )

If run in a script, the output will be paste in the script, if in the console it will be paste in the console.

One advantage is not need to have another package than tibble to recreate the data.frame/tibble. datapasta is only needed to generate the data.frame object as a tribble call. Nice features!

datapasta::tribble_construct outputs a string that can be print with cat.

There is also df_paste and df_construct for data.frame call creation. And also other feature that one can discover in datapasta

JohnMount · June 12, 2018, 10:34pm

datapasta looks neat. It definitely should get more attention.

I'd just say if one is debugging a data.table issue then not having to have tibble active is a similar advantage (one less possible source of interference).

milesmcbain · June 12, 2018, 10:37pm

Thanks @cderv!

I didn't mention datapasta in the original write up because it will silently convert complex objects it can't write in a tribble to character. I thought this might be confusing for people new to this type of thing. However it tries to make up for that with convenience.

Small note: When you call datapasta *_paste functions with arguments they just write directly to the active source pane or console without going via the clipboard.

cderv · June 13, 2018, 6:02am

Good to know! thanks for the precision. I was confused by the _paste suffix.

cderv · June 13, 2018, 6:07am

I agree!
Adding a datatable_paste and friends in datapasta could be helping for those users. There are currently just data.frame and tibble.

milesmcbain · June 13, 2018, 6:22am

Good idea. data.table would be a worthwhile addition.

jonocarroll · July 20, 2018, 1:52pm

I thoroughly agree. Here you go.