github actions failing a test on windows that passes locally - encoding issue

maja · February 13, 2023, 8:01am

I have a test that is failing on windows on gh-actions, but passing on windows locally and all other platforms.

I have created a minimal rep example in this repo and you can see the failing action here.

The test does the following:

imports a dataframe from an excel file, the data contains non-ascii characters and looks like this:

V1
Test
Tešt
čšž

uses the function codes to recode the variable:

codes <- function(x){
  switch(x,
         "Test" = 1,
         "Te\u0161t" = 2,
         "\u010d\u0161\u017e" = 3,
         stop(paste0("nope, isn't working for", x)))
}

with the error " Caused by error in codes(): nope, isn't working forc�� (sorry, there's a missing space there). So it is failing on the third row, but I don't know hot to debug this any further.

NB: I have to use the unicode codes because R CMD check won't allow non-ascii characters in the package.

Gabor · February 13, 2023, 11:07am

How do you create these strings? My guess is that they are in the native encoding, which is latin1 on R 4.1.x Windows that you are using on GHA. You are probably using R 4.2.x locally, which is UTF-8 by default.

I think a good fix is to create UTF-8 strings, even on non-UTF-8 platforms. If that's possible.

maja · February 14, 2023, 2:57pm

The strings were manually entered into Excel. And I am using the same 4.1.2 version of R both locally and on the runners on github. But so this would make sense, that it's an issue with the Excel encoding, I can try checking that, thanks!

maja · February 14, 2023, 3:13pm

No, the Excel file is encoded with UTF-8, so that isn't the issue I'm afraid

Gabor · February 14, 2023, 8:23pm

Well, that does not necessarily mean that the strings that you read from Excel are UTF-8 in R. So please check if they are, both locally and on GitHub.

maja · February 16, 2023, 9:04am

Wow, this encoding stuff is really just wow..

I'm not sure this is what you meant, but I added two print statements to the test suite and they print out the following - reminder, the character vector is "Test", "Tešt", "čšž" - (and you can see the action on gh here):

declared encodings with Encoding():`
[1] "unknown" "UTF-8"   "UTF-8"  

detected encodings with stringi:`
[[1]]
    Encoding Language Confidence
1 ISO-8859-1       en       0.60
2 ISO-8859-2       ro       0.60
3      UTF-8                0.15
4   UTF-16BE                0.10
5   UTF-16LE                0.10

[[2]]
  Encoding Language Confidence
1    UTF-8                 0.8
2 UTF-16BE                 0.1
3 UTF-16LE                 0.1
4  GB18030       zh        0.1
5   EUC-JP       ja        0.1
6   EUC-KR       ko        0.1
7     Big5       zh        0.1

[[3]]
      Encoding Language Confidence
1 windows-1252       no       0.85
2        UTF-8                0.80
3     UTF-16BE                0.10
4     UTF-16LE                0.10
5    Shift_JIS       ja       0.10
Error: Error: R CMD check found ERRORs
6      GB18030       zh       0.10
7         Big5       zh       0.10

Was that what you meant @Gabor or is there some other way I can check?

Gabor · February 16, 2023, 9:53am

Yeah, that's what I meant. So the declared encoding is UTF-8, but the strings might not be UTF-8. You can use stringi::stri_enc_isutf8() to check if they really are or not. Encoding detection is ambiguous, the same byte sequence can be valid in multiple encodings.

If they are not UTF-8, then the question is why aren't they. If they are marked UTF-8, but they are not in fact UTF-8, that certainly seems like a bug.

Btw. I don't really understand how this is different locally and on GH. I see it locally as well.

maja · February 16, 2023, 12:10pm

OK, I've added the check if it's utf-8 and I get [1] TRUE TRUE TRUE
both locally and on gh-actions .

I'm getting more and more confused, one thing is the declared encoding, but what do you mean by the actual encoding? I understand if there is no declared encoding there is ambiguity, but where are these declarations happening (or getting lost)?
And what do you mean when you say you see it locally as well, you mean the test is also failing for you locally?

Just to reiterate, all the declared and detected encoding outputs are the same for me locally as well as on gh-actions, but the test is passing locally and failing on gh-actions.

maja · February 16, 2023, 1:39pm

Aha, I may be on to something: I also tried changing the R version on the runner to 4.2.2, since 4.2 is meant to support native UTF-8 encoding (Upcoming Changes in R 4.2 on Windows - The R Blog) and now the check is failing for a different reason!!

❯ checking R files for syntax errors ... WARNING

129 Warnings in file 'R/hello.R':

130 unable to translate 'Te<U+0161>t' to native encoding

131 unable to translate '<U+010D><U+0161><U+017E>' to native encoding

Gabor · February 16, 2023, 2:59pm

Yes, I see this locally, on Windows and R 4.1.3 with encoding:

> l10n_info()
$MBCS
[1] FALSE

$`UTF-8`
[1] FALSE

$`Latin-1`
[1] TRUE

$codepage
[1] 1252

$system.codepage
[1] 1252

My guess is that you don't see it, because your native encoding is different.

More importantly, my other guess is that the issue is that switch() (or parse() when your file is parsed?) converts the strings into the native encoding, and that is different for you than the one on GitHub. (And I have the same native encoding as GitHub.)

E.g. this gives me "not ok":


x2 <- "\u010d\u0161\u017e"

res <- switch(x2,
  "\u0161" = "ok",
  "\u010d\u0161\u017e" = "ok2",
  "not ok"
)

print(res)

Of course this means (assuming I am right) that you cannot use non-ASCII strings in switch(). You'll need to use the old school if (...) ... else if (...) ... else ... construct, or match() or something else.

maja · February 16, 2023, 3:32pm

Yes, my local native encoding is different:

> l10n_info()
$MBCS
[1] FALSE

$`UTF-8`
[1] FALSE

$`Latin-1`
[1] FALSE

$codepage
[1] 1250

$system.codepage
[1] 1250

You are right about the switch being the culprit (or at least one of them). If I use a character that is not in the 1250 codepage, the switch doesn't work, but an ifelse does

x2 <- "\u010d\u0161\u00c5"

res <- switch(x2,
              "\u010d\u0161\u00c5" = "ok2",
              "not ok"
)

> print(res)
[1] "not ok"
> ifelse(x2 == "\u010d\u0161\u00c5", "ok2", "not ok2")
[1] "ok2"

But this is still so confusing. I'm going to check what the encoding is meant to be on the GitHub runner, but how do I change it? I mean switching to R 4.2 changed something, but I don't even understand what .

And if 4.2 supports native UTF-8, why do i get unable to translate 'Te<U+0161>t' to native encoding, why couldn't it translate <U+0161> to UTF-8?

And how am I supposed to solve this? I need non-ASCII characters in my code, and I want to do the R CMD check on github actions, surely these two things cannot be incompatible?

The actual switch that I have has 16 cases, I'm not keen on rewriting it as an ifelse..

maja · February 16, 2023, 4:07pm

OK, so on the GitHub runners, the Windows one (which is still failing) has the following local encoding (output of l10n_info()):

  $MBCS
  [1] TRUE
  
  $`UTF-8`
  [1] TRUE
  
  $`Latin-1`
  [1] FALSE
  
  $codepage
  [1] 65001
  
  $system.codepage
  [1] 65001

This is the output on the mac and ubuntu runners, where the switch code works fine

$MBCS
  [1] TRUE
  
  $`UTF-8`
  [1] TRUE
  
  $`Latin-1`
  [1] FALSE
  
  $codeset
  [1] "UTF-8"

The 65001 codepage is apparently just UTF-8 when it's in WIndows, so I still don't get why I'm getting the unable to translate warning?

Gabor · February 16, 2023, 4:56pm

I am sorry to say, but you can't use switch() with non-ascii strings, even if you find a workaround on GHA, because your users might have different native encodings. You can maybe use match().

maja · February 17, 2023, 9:52am

OK, I have rewritten the function to use match instead of switch, and it is now passing, thanks for that Gabor.

Still, this leaves me unsatisfied: all the runners on GHA had UTF-8 as their local encoding, but it was only the Windows one where the code didn't work. And I still don't understand why.

Gabor · February 17, 2023, 11:13am

Well, your test does pass on R 4.2.x Windows, there is only a warning coming from R CMD check, and that probably happens because R CMD check runs that particular check in the C locale, so only ASCII function argument names are allowed. I haven't really checked this in depth, but I am fairly sure that this is what happens. E.g. my native encoding is UTF-8 by default, so this works:

❯ R -q -e 'switch("\u010d", "\u010d" = "match", "not match")'
> switch("\u010d", "\u010d" = "match", "not match")
[1] "match"

But if I switch to the C locale, then it does not:

❯ LANG=C R -q -e 'switch("\u010d", "\u010d" = "match", "not match")'
> switch("\u010d", "\u010d" = "match", "not match")
[1] "not match"
Warning message:
unable to translate '<U+010D>' to native encoding

Whether this is a bug in R CMD check, I am not sure. Once we only need to support UTF-8 systems (if that ever happens), I guess we can "fix" R CMD check to run in a UTF-8 locale.

system · March 30, 2023, 8:01am

This topic was automatically closed after 45 days. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.