Something that is not clear from this blog post is that running reprex::reprex() on some code is not enough to create a reprex. The output is a reprex only if there is no error message in the output (meaning the code is self sufficient). In particular, if external data is loaded in the code, it will not be a reprex, even if the code is nicely formatted and includes the libraries.
I have seen obvious signs of confusion on several threads from new R users. I feel that it is easy to get confused between reprex the package (and function) and what a reprex really is. And the blog post makes it even more confusing because it sounds like reprex() creates a reprex.
The way I would try to make this more clear is by explaining what a reprex is and how to make one (no package involved here). Then bring up the reprex package and explains that it is a convenience which:
tests whether our reprex really is a reprex (if it is not, we get a nice error message)
formats the code by putting it in an rmarkdown chunk
has some very convenient copy-paste functionality
But the package does not create the reprex. The reprex was already there before we ran the reprex() function on it. And if it wasn't there, running the function will not magically turn it into a reprex.
I understand what you're saying, in that you can run the function over code and render something which is not a minimum viable reproducible example. However, especially in GitHub issues (even more so when they're in ggplot2, or any visualization context) we really are asking users to use the package because of the image generation and uploading functionality. Obviously this is getting things down to semantics, but, given the level of confusion around these things, I think it's important to try to disambiguate without decoupling a term that's being intentionally used (i.e. there's a reason we introduced the vocabulary).
I think @jennybryan describes what you've outlined above really well in the community call (starts ~10:40).
And in the accompanying slide deck.
The scenarios in which you can successfully run reprex without throwing the desired error (bc reprexes can contain errors, in fact, we often request them so we can see exact context errors) is when:
Someone is loading data through a local path.
Someone is running reprex over a function definition, and excluding the problematic code.
The RStudio and Jenny's resources are phenomenal. I was criticizing neither the package (which I use all the time now!) nor those resources. I was only pointing out what seemed to me like a weakness in the blog post which is given as the first link in EconomiCurtis's message about reprex (so the link the most likely to be followed at this point).
My perspective came about when seeing examples on this forum of users writing things like: "Here is a reprex" and what follows is not a reprex (because the user loaded local data so it is impossible to run their code to try to help them and that defeats the purpose).
Thanks for the background info on the choice of the name. I was myself a little confused when first looking at the package actually.
I understand why you want people to use the package and its formatting and other functionalities are amazing. I love it. But since reprex() can create totally non-reproducible examples, I still find its name confusing.
Maybe I would have called the package forex for "formatted example" . And then you'd want people to post "foreprex" (formatted, reproducible examples) by making sure that they'd run forex() on a reprex
Or maybe a really drastic solution would be to have reprex::reprex() only output the error message if run on a non-reprex That way, the only way to post a reprex somewhere while using the package is to make sure that the code really is self-contained.
And this might make sense: after all, if you try to run ggplot() on some code that is missing bits and pieces, you will not get a graph. But if you run reprex() on code missing important bits, you still get a nice looking nicely formatted chunk ready to be pasted. Sure there is an error message at the bottom of it. But apparently, some people don't seem to notice it.
Do you think I could suggest this to the package maintainer in an issue? or will everybody get annoyed and roll their eyes at this idea??
Many tidyverse functions are more strict than their base R counterparts and will not run, but throw an error message when there are problems or inconsistencies in the code. So this might actually fit the tidyverse philosophy. No?
To offer another perspective, when somebody reprexes some non-self-contained code, I think:
They are signaling good faith and respect for their helpers, even though they didn't get it totally right
Now I can point to a specific error message when discussing what else they need to do to make the example reproducible (and maybe they'll even learn something about diagnosing self-containedness)
We are probably closer to a solution than if they had offered a loose verbal description of their problem or any of the thousand other things people post instead
So basically, I find the reprexing of non-reproducible examples to be an acceptable risk. I strongly support anything we can do with educational materials and software to help people learn how to turn their questions into reproducible examples! But I think that learning to do that requires a degree of abstraction and a more comprehensive idea of what the heck you're doing than lots of people have when they're struggling with something new.
Those of us with more experience frequently say that in the very act of composing the reproducible example, we often solve our own problem (true!), and I agree that this is an essential skill to learn — eventually. Speaking for myself, I don't want perfect success at this task to be a precondition for people getting help, because there's been plenty of that already in coding land (see: RTFM ), and I am unsatisfied with the (lack of) inclusiveness that empirically seems to result.
When it comes to clarifying concepts, I agree that there are some muddied waters. Personally, I find myself moving towards using "reproducible example" or "self-contained example" to refer to the general concept, and reserving reprex (complete with formatting) for the tool. As cute as the word "reprex" is (it's really cute!), it's still yet another piece of jargon.
^^ Yes! And so often that's we don't have your local hard drive.
I tend towards the same thing, as (personally) I can't stand reading unformatted code (seriously, if I am really wanting to read a blogpost where they don't have syntax highlighting, I'll wait until I'm on my computer and can copy and paste).
I think that, perhaps, a better solution would be to change the title of this FAQ, or separate it into two. A question about what constitutes a reproducible example, and one about how to use the reprex package to make a reprex, should you choose to do so.
When I ask for a reprex, I explain the origins of the portmanteau, but I'm specifically directing them to using the package. (And, as with so many neologisms, yes, it's possible for something to qualify, without serving the intended purpose — not all raincoats are actually waterproof!)
When I'm describing reproducible examples, I point to resources that aren't contingent on reprex.
I think the more serious issue is that ca. 90-95% (my guesstimate) of requests for reprexes (complete with links) are ignored.
Some of this may be due to the issues debated in this thread, but there is also a large element of wanting answers without putting the effort into the question. I doubt this can be solved easily on a site like this with its liberal posting policies (as opposed to SO where the posts would likely get deleted), but I hope there is a way to increase the response rate for reprex requests (or at least something that can be a starter).
Since there are a lot of posts already linking to this topic — including with oneboxes — maybe it keeps a largely similar title and has a prominent link to the new complementary post? Or it becomes a landing page with a brief explanation and links to two new topics (like a Wikipedia disambiguation page)?
I think this is why non-self-contained reprex output feels to me like half the battle has been won, though I take seriously the concern that those posters who are trying might be dispirited if they think they've done as asked and it's still "not right".
Oh, absolutely. But if the function only gives the user an error message (and those errors are informative if the user does pay attention to them and realizes that this isn't quite a reprex yet), it is doing that work for you. And I am clearly not doubting anybody's good faith at all. Just thinking of efficient solutions.
I think it can still be all in one thread as long as the distinction is made clear (again, re-my initial message on that). All the external resources given on this thread so far are about the package (so the formatting). Maybe adding an intro on what a reprex really is and adding, for instance, this great link, then moving on to the reprex package might make things more clear:
(with 1400 upvotes, the main answer is quite a solid one).
Obviously, the answer pre-dates the reprex package (and even the term "reprex"), so the technicalities about pasting, etc. are outdated. But an excerpt of the relevant parts could be really useful:
A minimal reproducible example consists of the following items:
a minimal dataset, necessary to reproduce the error
the minimal runnable code necessary to reproduce the error, which can be run on the given dataset.
the necessary information on the used packages, R version, and system it is run on.
in the case of random processes, a seed (set by set.seed()) for reproducibility
Looking at the examples in the help files of the used functions is often helpful. In general, all the code given there fulfills the requirements of a minimal reproducible example: data is provided, minimal code is provided, and everything is runnable.
The parts about "minimalilty" can be omitted to keep things easier. If the example is reproducible, that is amazing enough I guess and it's ok if it is not minimal
Edit: I should have read the whole post before starting to type (typical me... ) and just realized that you are mentioning the same link. I feel some of this info could be included in EconomiCurtis' post though so that it is all in one place that we can easily post links to.
No big! It's all those posts (plus Jenny's trove of teaching experience/dealing with hundreds of issues per semester) that were the motivation to make it easier to shorten the gap between what's needed, and how hard it is to make one. I don't think they're mutually exclusive, I think one's a tool to help accomplish the goal described in both.
I think a major factor that's unfortunate, but true, is that there's a limited attention span for this info among users, so I think nuance gets dropped in the summaries in favour of the whole thing being read, which I think happens on the site, too. Like, I've started to toggle away part of my "classic" reprex response, because I think once it gets over a certain length, people just ignore it entirely.
^^ that point is orthogonal to the other one, I just made that mental leap!
The goal of this FAQ - Replying to a few people, the point of this page is to help people understand how to create a reproducible example. I argue that we should converge towards refering to a "reprex" as the idea of a reproducible example (as Romain "first" used it), and the reprex-package referring to the tool that helps people create reproducible examples.
If there's a better paradigm, I'm happy to stick with that for consistency and encourage others to as well.
Changes to this page - I actually think this page is getting a little too long and "unfriendly" to people new to the site. That is, unfriendly in the amount of text we suggest they read (that's why I moved this thread into a new discussion).
Flagging no reprex
There are a few of us starting to experiment with flagging users who don't provide a reprex.
Specifically, you'd flag a user, select custom, and mention lack of reprex.
That would then trigger me (manually at this point, but automated if it's a feature we desire) to send a polite PM.
We should be requesting reprexes publically to set a good example, but in a Private Message I can spend a lot more time explaining why it's important , linking to meta-resournces, replying to their specific concern, being a bit more direct that is considered polite in a public thread
A PM therefore is less distracting from the original thread.
Flags like this in low numbers have no effect on user flagged. But if a user has a bunch of no-reprex flags we'll escalate our intervention. (We don't want help vampires here.) So, I think I'd prefer to link to other resources.
I really like this. The expression "reprex" would otherwise be hijacked from his author.
In the issue I posted on the reprex-package, I used reprex (no marking) when meaning reproducible example (having fully accepted it as a short for that expression as it was initially coined), reprex when referring to the package, reprex() when referring to the function, and "reprex" when referring to the output of the function when it is not a reprex (and it is the existence of this last category that pushed me to open up the issue and to suggest a change that would (in my naive noob opinion) get rid of all the semantic non-sense).
At the risk of expressing an unpopular opinion, I'd like to say I'm not a fan of the "reprex" neologism. At present it appears to only be coming into vogue among (some) R users, and as someone who works with other technologies seeing this term only in R contexts makes it seem really forced and awkward. I don't mind the usage as a package name and I'm absolutely on board with the value of reproducible examples and teaching new users the importance of creating them, but somehow the term itself aggravates my curmudgeonly old eyes. I don't mean to be trolling here, I just thought others might share this feeling but were afraid to say so.