Issues with CSV uploads and character encoding in Shiny

jseiden · June 14, 2018, 3:29pm

Hi everyone,

I'm new to Shiny and having some problematic behavior with my first app. Thank you in advance for your help.

I've created a simple app that allows me to call a complex scoring algorithm for a standardized assessment of child data. My colleague created a package that processes the raw data and generates various composite scores. https://github.com/marcus-waldman/credi

Knowing that many non-R users are going to want to use this, I wanted to create a simple Shiny app that allows users to upload their data and download the processed results. Conceptually it does the following:

User inputs a CSV with raw scores
subset CSV with only relevant variables so that it can be fed into the scoring package
generate processed scores with scoring package
merge the processed scores with the uploaded dataset
provide a download link for the processed files

I've partially accomplished this here: https://credi.shinyapps.io/credi/ It works with many of my files, but I'm trouble with some (but not all) CSVs that I upload. Oddly, it works perfectly in my R Studio IDE, but once deployed is when I start to have issues.

I understand that the trouble I am having is due to character encodings in some of the csvs that I am uploading (it breaks if the csv has accented characters (e.g. á, é, Ó)). Originally I had simply:

rawdat <- read.csv(inFile$datapath, header = TRUE, sep = ",")

I have tried to fix this by adding encode = "UTF-8" but now I am getting an error

Warning: Error in gsub: input string 2 is invalid UTF-8

I assume this is related to the special characters that are not being understood when I define the encoding. Is there any workaround to this?

I can't post CSVs here, but if you want a reproducible example, I am happy to provide one.

Thanks again,

Jonathan

jseiden · June 18, 2018, 8:43pm

I wanted to post an update and an additional plea for help. I've read this article http://shiny.rstudio.com/articles/unicode.html

And it's pretty clear that this is being caused by CSVs with different types of encoding. If I have CSVs in Unicode, but my environment is set to UTF-8 (which I believe is the default on the Shiny Server), then I am going to have issues.

Are there any generalizable solutions that allow me to 1) detect the character encoding of a CSV and 2) set my CSV to read it in said encoding?

jcblum · June 18, 2018, 9:13pm

I am by no means an expert in dealing with text encoding issues (it's a thorny problem and I hope somebody more knowledgeable than me will chime in!), but as a starting point the answers to this Stack Overflow question identify all the tools I know about:

In the end, I don't think there's a completely bulletproof way to do it because the file may not contain enough features to be diagnostic between possible encodings. You may be forced to fall back to rejecting some files and asking the user to re-export their CSV with a specific encoding, which I know is distasteful because it's asking way too much for some user communities. FWIW, Excel may be the source of a lot of your troubles, in which case that link has advice on how to advise your users, at least.

jseiden · June 19, 2018, 2:33pm

Thank you for the very helpful reply! I hadn't seen that Stack Overflow thread, so thank you for pointing it out. I think you're right that we may have to reject certain types of files. I'll post back here when I figure what works for my particular use case.

jseiden · June 29, 2018, 8:13pm

I wanted to give a brief update to this post. As @jcblum indicated, character encoding is a surprisingly tricky issue and there doesn't seem to be a foolproof generalizable solution. This, combined with the fact that the vast majority of my anticipated users will be importing data from Excel (which is often the culprit), led me to change my strategy for accepting files. Rather than creating a df from a csv using read_csv, I opted to use read_excel from the readxl package. This seems to be working!

Thanks again.