Character Encoding within the tidyverse

As I have been using R, I seem to bump into encoding issues more and more often, and would like to help fix some of these within the tidyverse. However, I am not sure what the de-facto standard is for these types of issues, and fixing them in a proper fashion to attempt to help alleviate these issues, as they can be very troublesome. Is rlang meant to help alleviate these issues from within R, or should one be using stronger tools, like stringi instead for a PR?

3 Likes

Good question, and I'm not sure of the answer. @jennybryan, do you know if there's a simple rule for this?

@pgensler, all of RStudio is about to be in one place at the same time for a week and change, and the lower/higher-language-fix questions are definitely on my list of things to investigate.

As a Windows R user (I guess you are too?) I also run into encoding problems on a regular basis and I also submitted the odd bug report regarding character encoding within the tidyverse, e.g. here and here. From my experience, most issues can be alleviated by either encoding strings to UTF-8 with enc2utf8 or explicitly declaring a string as UTF-8 with Encoding(string) <- "UTF-8" (most of the first and the second example above). Or require they require working around/fixing issues in base-R, e.g.

In general I'd say that the proper way to fix it depends on the issue. If you are unsure, open an issue report and discuss with the maintainer before making a pull request. Do you have any examples?

2 Likes

Yeah, issues like this are some of the pain's I've dealt with:

and

https://forum.posit.co/t/split-uneven-length-vectors-to-columns-with-tidyr/
which are good examples. The second illustrates just how bad it gets, as you really need to share the byte sequence to see the issue, not just the string itself.

I'm just curious if we should be using something like TERR, or if RStudio plans to implement its own flavor of R to solve these issues.

With apologies for vagueness, I think the rule is "UTF-8 All The Things".

Now that doesn't tell a user or developer exactly what to do, but the overall current of development is to push everything towards UTF-8. I would love for us to develop and share guides at some point about how to implement this principle in, say, your own package. I need this guide myself! But that's just a goal at this point.

Re. the aforequoted:

That was in response to your listing:

#want to unnest list to chr vector
options are:
  -flatten()
  -unnest()
  -unlist()
  -squash()
  -anything in purrr ?

@mara is is possible to get some clarity around when we should be using the above functions for what, and when?

I then disambiguated those functions, after conferring with @hadley and @jennybryan

I understand that said disambiguation wasn't the totality of your problem there, but I'm unsure as to why you're quoting that one line from me in re. this issue…

@dpprdan Yes, for work, but I've run into issues where I end up having to use iconv before I can even import the file into R.

@mara I wasn't trying to quote you, just link to the thread. Should we be using specific packages to help in dealing with these? This package seems tremendously helpful https://github.com/patperry/r-utf8 l, but adding that burden onto every other tidyverse package is not a small fiasco, so then what is the preferred solution? Build a new version of R, and host on GitHub, or build it into rlang? I would imagine there is enough complaints from different users that it may warrant worth making RStudio's own flavor of R, but I could be wrong.

It generally seems to me that a fork of R would be antithetical to RStudio's vision/mission of being a part of / supporting the R community. Forking can contribute to division (i.e. now there are two "masters") and can isolate development (especially when the change is as core as encoding). I expect that there are other more collaborative ways that the goal will be realized. Then again, I mostly avoid Windows for this and many other reasons :slight_smile:

This comic makes the notion clear in a comedic way :slight_smile:

More reading on the general problems with forking (although I must admit I have not read the articles completely):

https://mako.cc/writing/to_fork_or_not_to_fork.html

1 Like