How to upload or share data files here

Andrea · May 24, 2018, 8:40am

Hi,

sometimes, to create a Minimum Reproducible Example, it would be much easier to upload a csv or a txt file here. However, the upload popup here only allows me to upload images. Why was it designed this way? Surely not for the file size - text files (at least the ones I'd like to use) are much smaller than most image files. I'm allowed to upload a pdf with the data, but it would be absurd (and very disrespectful) to expect someone to use pdftools to extract data from it (I would be the first to refuse to do that, so I would never expect other people to do that!).

I know the first reaction to this will be "why don't you recreate your issue in a small data frame that you can include in the reprex, instead than in an external text file?". But it's often the case, when dealing with I/O issues, that:

I don't know exactly what is causing the issue
the text file is too big to include in the reprex (say, 1000x30) but still so small that loading it in R on any modern pc is basically instantaneous, so I'm not wasting peoples' time.

Of course, I'll first try to create a Minimum Reproducible Example which doesn't need the full data, but if I try and fail to, it'd be nice to be able to share data.

Another solution would be to upload the file to Google Drive and share the URL here.

mara · May 24, 2018, 1:09pm

Just an FYI, this site uses Discourse, so I'm not sure anyone here will be able to answer your question re. the design choices.

Another option (in addition to Google Drive) is to use a gist:
https://gist.github.com/

EconomiCurtis · May 24, 2018, 1:18pm

I would personally encourage a making a small example with dput or tribble or tibble. And failing that, linking to github-gist/google-drive/dropbox/box/ms-drive, etc., for larger files. I think discourse hasn't gotten to this because there are a plethora of options already.

In the long-run (one-to-two years out...?) I hope rstudio.cloud will be a good option for complicated shares, like shiny apps.

tbradley · May 24, 2018, 1:18pm

You should check out the datapasta package. It makes adding data as R code super easy. I highly recommend it.

Andrea · May 24, 2018, 3:30pm

I've never used datapasta so I can't say for sure, but I don't think it would help in my case. I can't include a 1000x30 table in a post here: it would be unreadable, and difficult for people to copy to their R enviroment. +1 for mentioning such a great package though

pssguy · May 24, 2018, 3:33pm

I think everybody seems to agree that should be your first step . 6x30,say, would be a lot more manageable via dput

nwerth · May 24, 2018, 3:51pm

If your problem is I/O, then just add something like this to the top of your reprex:

datafile <- tempfile()
write.csv(mtcars, datafile) # or whatever data in whatever format

If you think your problem requires the data to be large but isn't related to memory limits, then I doubt a good reprex requires more than 3 rows or values as test data:

Something that should succeed
Something that should fail
Maybe an edge case.

Part of why minimal reprexes (reprices?) are encouraged is because they force people through a process:

Go through the program step by step, comparing the data along the way with their expectations, until they find the pain point.
- If they can't do this easily, they see the value in rewriting the program so they can.
Make a test, even an informal check like print(x), that compares the result with their expectation.

Effectively, the skills for making a reprex are the skills for debugging and writing unit tests. There's a decent chance that, once you've made a minimal reprex, you've solved your own problem.

tbradley · May 24, 2018, 4:03pm

You are right that it would be unreadable in the browser, but you can use it. I just did it as a test and it works with a 1000x30 data frame of random numbers. I will say that I had to change the maximum allowed number of rows from 200 to 1001 for it to work and the output definitely wasn't pretty (and it took several seconds to run). However, with that said, I agree with @nwerth that there probably aren't very many cases where you would have to use your full data set, and you should always try to use a subset of your data that represents what you are trying to do.

Andrea · May 24, 2018, 4:56pm

I don't think that's always true. Once I had an issue was that data in memory would occupy much more memory, once loaded, as I would expect based on its disk size. The issue wouldn't reappear if I would get just a small random sample of rows.

Having said that, I agree that linking a "bigger than 10 rows" data set should only be attempted as a last resort.

jcblum · May 24, 2018, 5:50pm

I think that it is technically possible for a Discourse instance to be set up to accept other types of files as message attachments (rather than displayed-inline uploads) — I believe there's an allowed file extension list somewhere in the site settings.

However, speaking for myself, I'd much rather be presented with a link to a Github gist with its nice CSV formatting where I can inspect the file before I download it, rather than be asked to trust that a forum message file attachment is what it claims to be . I know you are all fine, upstanding people , but downloading random files from forums just seems like a great way to wind up with something nasty on your computer

mara · May 24, 2018, 6:16pm

Precisely! That's another reason I suggested gists – at the very least, you have the chance to datapasta it yourself given how it displays!

Andrea · May 25, 2018, 1:33pm

@jcblum this is a great point and one that in my naivete I didn't consider Probably this is why only image attachments are allowed