Error in tokenize(reference, what = c("word")) : unused argument (what = c("word"))

I'm really new to R, I'm sorry!

I'm trying to tokenize a txt by word and getting errors with all commands I try, but this one's the most infuriating and nonsensical one. I'm trying a simple-looking

> tokenwords <- tokenize(reference, what = c("word"))

which looks like a basic R command and I got here. I understand that if I excluded "what" bit it should default to tokenizing by word, but that's not the case: it tokenizes by line, as the data was set up beforehand.

Why am I getting the below error, then? What does it MEAN "unused argument?"

Error in tokenize(reference, what = c("word")) : 
  unused argument (what = c("word"))

I've also tried unnest_tokens and the tokenizer package and they give me different, unexplained errors as well, I just thought I'd pick the simplest-looking error.

What is the result of

getAnywhere (tokenize) 

?

> getAnywhere (tokenize) 
A single object matching ‘tokenize’ was found
It was found in the following places
  package:readr
  namespace:readr
with value

function (file, tokenizer = tokenizer_csv(), skip = 0, n_max = -1L) 
{
    ds <- datasource(file, skip = skip, skip_empty_rows = FALSE)
    tokenize_(ds, tokenizer, n_max)
}
<bytecode: 0x000002c17fe1b3e8>
<environment: namespace:readr>

(Looks like it's indeed from a specific library and not from R, sorry!)

That link suggests that the quateda::tokenize function is depreciated which may be a problem and depending on what packages you load, and their order, the readr package also has a tokenize function which may have overridden the quateda function.

I think we need more of your code and some sample data.

I'm trying to follow this sentiment analysis tutorial with those same files (here, I've been trying it with the first one by Adams) except pretty early on I hit

# tokenize
tokens <- data_frame(text = fileText) %>% unnest_tokens(word, text)

and that's deprecated and says to use tibble. So, upon getting there, I've tried

  1. tokenize (apparently from readr), which if it doesn't include the "what" part simply splits by sentences
  2. tinyverse unnest_token, which gives me the below error
Error in check_input(x) : 
  Input must be a character vector of any length or a list of character
  vectors, each of which has a length of 1.
  1. tokenizer's tokenize_words, which gives me the below error and oddly seems to tokenize spaces as their own words, which I could figure out how to clean up but I'm not entirely sure of how to. [ETA: it also puts all tokens in one single line, even though the original data is spread over several lines]
Warning message:
In stri_replace_all_regex(x, c("[<U+FE00>-<U+FE0F>]", "\\s[`-<U+036F>]"),  :
  argument is not an atomic vector; coercing

I had a quick look at that tutorial last night and again today and it looks rather overly complicated for someone just getting into R.

I know essentially zero about text analysis but a bit about R. What is your level of experience with text analysis?

This is just guessing but I'd suggest sticking with one package, probably tidytext as it is more likely used by forum participants who may be able to help.

Read up a bit on tidytext and work through a couple of examples to get a feeling for what it can do. Text Mining with R looks like a good place to start.

If you have not already seen this, the CRAN Task View: Natural Language Processing might be of general interest

In the meantime, did I understand that you successfully converted that data.frame to a tibble? If so could you post the tibble here? Probably the easiest and most convienient way would be to paste it in dput() format. See ?dput for more information.

If anyone here wants to provide some advice, it is extremely useful to be working from the same data.

Best wishes.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.