for a text mining activity i need to extract topics from some emails. The corpus of my documents come from HTML code. Data are stored in a Cloudera Big Data Environment. The problem born when i import in R the HTML code's field. R trunc the string column, so i can only read some parts of the documents text.
Is there a lenght's threeshold for character variables in R? there a lenght's threeshold in Rstudio? there's a way to change this threeshold?
in other way i can parse the html with some Big Data environmente components like Hive or Spark and import in R only the term-documents matrix for analysis, but it is tricky to parse text for me and a long activities without R.
anyone can help me?
thanks in advance
have nice day