strsplit giving NA's

I had uploaded basic text file in Cloud . However whenever I am using strsplit function for the text, I am getting some NA's for some words which never existed in desktop version. What could be the reason ? Due to this, I am not able to get output for text mining ?

Cloud output - "CHAPTER" "I" NA NA "All" "her"

Desktop output - "CHAPTER" "I" "“Well," "Prince," "so" "Genoa"

sherlock <- readLines('war.txt')

text_sherlock <- sherlock %>% 
  strsplit(" ") %>% 
  unlist()

Getting this errors in console :

Warning messages:
1: In strsplit(., " ") : input string 2 is invalid in this locale
2: In strsplit(., " ") : input string 4 is invalid in this locale
3: In strsplit(., " ") : input string 8 is invalid in this locale
4: In strsplit(., " ") : input string 10 is invalid in this locale
5: In strsplit(., " ") : input string 12 is invalid in this locale

It's most likely an encoding issue. Can you determine the encoding of war.txt ?
perhaps save it explicitly as UTF8 would fix the issue

Can you share the text inside war.txt?

Also, copy the output of executing locale() in your Desktop console. Probably you have a different encoding that the one used in RStudio Cloud so there's the difference.

getting the same error after saving as UTF8. encoding is ANSI

Text from "War and Peace "

CHAPTER I
“Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”

It was in July, 1805, and the speaker was the well-known Anna Pávlovna Schérer, maid of honor and favorite of the Empress Márya Fëdorovna. With these words she greeted Prince Vasíli Kurágin, a man of high rank and importance, who was the first to arrive at her reception. Anna Pávlovna had had a cough for some days. She was, as she said, suffering from la grippe; grippe being then a new word in St. Petersburg, used only by the elite...

locale() from my desktop :

Numbers: 123,456.78 Formats: %AD / %AT Timezone: UTC Encoding: UTF-8 Days: Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday (Thu), Friday (Fri), Saturday (Sat) Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May), June (Jun), July (Jul), August (Aug), September (Sep), October (Oct), November (Nov), December (Dec) AM/PM: AM/PM

I reproduced your error by creating a .txt file directly in RStudio Cloud and copying and pasting your text. Doing this your code works ok for me, so for sure the problem should be the codification of the file.

If you are trying to edit your file using Windows maybe the error has something to do with this:

https://developer.r-project.org/Blog/public/2020/05/02/utf-8-support-on-windows/index.html

1 Like

can you pls share how to create a txt file withing RCloud ( without uploading it ?)

Sure! Go to the first icon under file (which is an empty page with a green plus sign) and in the dropdown menu you will find "Text file" which is the option that creates a new file with the .txt extension

Captura de pantalla 2020-07-16 a las 21.21.50

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.