Coming up with Example/Dummy Datasets for a REPREX

EconomiCurtis · August 23, 2018, 10:59am

Coming up with example and dummy data is an important parts of of a reprex, and not covered well in the "FAQ: What's a reproducible example (reprex) and how do I do one?" and "FAQ: Tips for writing R-related questions" guides.

The goal of this topic is to recap a private discussion sustainers had on this, and draft a new section on creating dummy datasets for reprex's

Borrowed heavily from

Stack Overflow's excellent "How to make a great R reproducible example"
Best Practices: how to prepare your own data for use in a `reprex` if you can’t, or don’t know how to reproduce a problem with a built-in dataset?

Producing a minimal dataset

Built in datasets

You can use one of built-in datasets, which are provided with most packages and base-R.
A comprehensive list of built-in datasets can be seen with library(help = "datasets") . There is a short description to every dataset and more information can be obtained for example with ?mtcars where 'mtcars' is one of the datasets in the list. Other packages might contain additional datasets, for example ggplot2's diamonds dataset.

Creating your own vector and data frame

Making a vector is easy. Sometimes it is necessary to add some randomness to it, and there are a whole number of functions to make that. sample() can randomize a vector, or give a random vector with only a few values. letters is a useful vector containing the alphabet. This can be used for making factors.

A few examples :

x = c(1,2,3)
random values : x <- rnorm(10) for normal distribution, x <- runif(10) for uniform distribution. (Here's a list of all distriubtions in the R stats package)
a permutation of some values : x <- sample(1:10) for vector 1:10 in random order.
a random factor : x <- sample(letters[1:4], 20, replace = TRUE)

Making data frames can be done using data.frame(). One should pay attention to name the entries in the data frame, and to not make it overly complicated.

You may make a data frame with the data.frame() function. One should pay attention to name the entries in the data frame, and to not make it overly complicated.

An example :

set.seed(1)
data <- data.frame(
  X = sample(1:5),
  Y = sample(c("yes", "no"), 5, replace = TRUE)
)
data
#>   X   Y
#> 1 2  no
#> 2 5  no
#> 3 4  no
#> 4 3  no
#> 5 1 yes

Created by the reprex package (v0.2.0.9000).

For some questions, specific formats can be needed. For these, one can use any of the provided as.someType functions: as.factor, as.integer, as.numeric, as.character, as.Date, as.xts.

tibble::tribble() - Handy if you have the patience to hand type out a some data for your audience in a pretty format. There is a servere limitation in that not all data types can be represented in a tribble().

library(tibble); library(dplyr)
df <- tibble::tribble(
  ~date,       ~id,  ~ammount,
  "27/10/2016 21:00", "0001234",  "$18.50",
  "28/10/2016 21:05", "0001235", "-$18.50"
) %>%
  mutate(date = lubridate::parse_date_time(date, orders = c("d!/m!/Y! H!:M!")))
df
#> # A tibble: 2 x 3
#>   date                id      ammount
#>   <dttm>              <chr>   <chr>  
#> 1 2016-10-27 21:00:00 0001234 $18.50 
#> 2 2016-10-28 21:05:00 0001235 -$18.50

Created by the reprex package (v0.2.0.9000).

readr::read_csv() - It’s possible to represent your data, complete with type specification, as a read_csv() call. This can be helpful when you want to copy and paste from a CSV file.
The previous would be:

library(readr)
df <- readr::read_csv(
'date, id, amount
27/10/2016 21:00, 0001234,  $18.50
28/10/2016 21:05, 0001235, -$18.50',
  col_types = list(col_datetime(format = "%d/%m/%Y %H:%M"),  
                   col_character(), col_character() )
)
df
#> # A tibble: 2 x 3
#>   date                id      amount 
#>   <dttm>              <chr>   <chr>  
#> 1 2016-10-27 21:00:00 0001234 $18.50 
#> 2 2016-10-28 21:05:00 0001235 -$18.50

Created by the reprex package (v0.2.0.9000).

read.table - Worst case scenario, you can give a text representation that can be read in using the text parameter of read.table :

df_txt <- 'date, id, amount
27/10/2016 21:00, 0001234,  $18.50
28/10/2016 21:05, 0001235, -$18.50'

df <- read.table(text=df_txt, header = TRUE)
df
#>             date.      id.  amount
#> 27/10/2016 21:00, 0001234,  $18.50
#> 28/10/2016 21:05, 0001235, -$18.50

Created by the reprex package (v0.2.0.9000).

Copy your data

If you have some data that would be too difficult to construct using the tips above, then you can always make a subset of your original data, using eg head() , subset() or the indices. Then use eg. dput() to give us something that can be put in R immediately:

For example with the built-in dataset iris:

dput(head(iris,4))

will produce the output:

structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5, 
3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2, 
0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = c("setosa", 
"versicolor", "virginica"), class = "factor")), row.names = c(NA, 
4L), class = "data.frame")

If your data frame has a factor with many levels, the dput output can be unwieldy, listing all the possible factor levels even if they aren't present in the the subset of your data.
To solve this issue, you can use the droplevels() function. Notice below how species is a factor with only one level:

> dput(droplevels(head(iris, 4)))
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5, 
3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2, 
0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = "setosa",
class = "factor")), .Names = c("Sepal.Length", "Sepal.Width", 
"Petal.Length", "Petal.Width", "Species"), row.names = c(NA, 
4L), class = "data.frame")

EconomiCurtis · August 23, 2018, 11:02am

Some notes from a private discussion on this topic:

@rensa:

Is it worth expanding the reprex FAQ with a small section on coming up with dummy data? It could be useful for (a) people who are working with confidential or private data, and (b) people who're super deep in the middle of an analysis and aren't sure how to boil it down to a reprex.

@jcblum:

Honestly, data is such a stumbling block for reprexes, I feel like there’s practically a short book’s worth of material to cover...

The SO version of how to make a good R reprex has some more specific recipes for data generation: how-to-make-a-great-r-reproducible-example

And there’s this, too: https://stackoverflow.com/questions/10454973/how-to-create-example-data-set-from-private-data-replacing-variable-names-and-l

@jcblum:

I’ve been trying to sort of informally keep track of various fake-data generating packages. There’s ropensci charlatan , but every time I try to use it I get frustrated. It also doesn’t have enough types of data for my taste.

I much prefer Tyler Rinker’s wakefield, but when giving advice to newbies I’d rather have a package on CRAN.

There are entire fields where people commonly use data blinding of various sorts — I’m not in one of those fields, but I maybe there’s prior work there?

@jonspring

I think more people would follow the reprex advice if it felt easier to get started. (And I'm torn here, b/c it's often that friction of working through it that helps people fix their problems themselves...)

Maybe a version of @jcblum's 2nd link above, or a simple function to generate a generic data frame with certain headers and column types...?

Clippy: It looks like this question could use a reprex. Would you like help?