Text mining: R limit on length of character variables?

Hi All,

for a text mining activity i need to extract topics from some emails. The corpus of my documents come from HTML code. Data are stored in a Cloudera Big Data Environment. The problem born when i import in R the HTML code's field. R trunc the string column, so i can only read some parts of the documents text.

Is there a lenght's threeshold for character variables in R? there a lenght's threeshold in Rstudio? there's a way to change this threeshold?

in other way i can parse the html with some Big Data environmente components like Hive or Spark and import in R only the term-documents matrix for analysis, but it is tricky to parse text for me and a long activities without R.

anyone can help me?
thanks in advance
have nice day

MC

I can make VERY long single strings in code without any problem:

library(stringr)
aa <- rep("ABC", times=100000)
aa <- str_c(aa, collapse = "")
aa
length(aa)
nchar(aa)

This suggests to me the problem is with the import of the HTML.

HTH

Hi

We don't really have enough info to help you out. Could you ask this with a minimal REPRoducible EXample (reprex)? A reprex makes it much easier for others to understand your issue and figure out how to help.

If you've never heard of a reprex before, you might want to start by reading this FAQ:

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.