I've faced an encoding problem with readr, which is triggered by the release of R 3.5.0, and tried to fix it on Rcpp's side.
tidyverse:master
← yutannihilation:use-ce-native-for-path
opened 10:56AM - 01 May 18 UTC
fixes #834, fixes #837
As described on #834, the behavior of `normalizePath(… )` (more precisely, `path.expand()`, which is used inside `normalizePath()`) has been changed to return the character encoded in UTF-8. I guess this happens only with Windows on non-UTF-8 locales.
R 3.4.4 on Windows 10:
``` r
Encoding(normalizePath("~/鬼"))
#> [1] "unknown"
```
R 3.5.0 on Windows 10:
``` r
Encoding(normalizePath("~/鬼"))
#> [1] "UTF-8"
```
This usually doesn't become a problem; on R codes, most R functions take care of the encoding automatically. But, on C++ codes, the encoding attribute is lost when the `String` is converted to `std::string`. So, we need to convert it to the proper encoding before passing it to the functions that take `std::string`, in this case, `boost::interprocess::file_mapping()` here:
https://github.com/tidyverse/readr/blob/6f0bb65296afa55709fd60cdc5d59a4c89623e36/src/SourceFile.h#L19-L20
It seems `boost::interprocess::file_mapping()` takes the path string encoded in the **native locale** (`CE_NATIVE`). So, we need to ensure the string is encoded properly, instead of blindly pass the one passed from R session.
IIUC, Rcpp doesn't have a function to do this, so `Rf_translateChar()` is the choice in this case. (I'm not an Rcpp expert, so please point out if I'm wrong at this point....)
But, I couldn't find the neat way to convert the Rcpp::String
to std::string
with the specified encoding. Base R has Rf_translateChar()
. Does Rcpp have the corresponding function for this?
More general question I want to ask is, are there any good resources about how to work with character encodings with Rcpp/C++/C? I always ran into these readings:
Any suggestions are welcome!
No, there is no corresponding Rcpp function, you just use the C API in this case on the underlying CHARSXP
.
2 Likes
Thanks, I'll try to get used to the C API.