Error in tokenize(reference, what = c("word")) : unused argument (what = c("word"))

queerterpreter · December 13, 2020, 5:12pm

I'm really new to R, I'm sorry!

I'm trying to tokenize a txt by word and getting errors with all commands I try, but this one's the most infuriating and nonsensical one. I'm trying a simple-looking

> tokenwords <- tokenize(reference, what = c("word"))

which looks like a basic R command and I got here. I understand that if I excluded "what" bit it should default to tokenizing by word, but that's not the case: it tokenizes by line, as the data was set up beforehand.

Why am I getting the below error, then? What does it MEAN "unused argument?"

Error in tokenize(reference, what = c("word")) : 
  unused argument (what = c("word"))

I've also tried unnest_tokens and the tokenizer package and they give me different, unexplained errors as well, I just thought I'd pick the simplest-looking error.

nirgrahamuk · December 13, 2020, 7:41pm

What is the result of

getAnywhere (tokenize)

?

queerterpreter · December 13, 2020, 8:00pm

> getAnywhere (tokenize) 
A single object matching ‘tokenize’ was found
It was found in the following places
  package:readr
  namespace:readr
with value

function (file, tokenizer = tokenizer_csv(), skip = 0, n_max = -1L) 
{
    ds <- datasource(file, skip = skip, skip_empty_rows = FALSE)
    tokenize_(ds, tokenizer, n_max)
}
<bytecode: 0x000002c17fe1b3e8>
<environment: namespace:readr>

(Looks like it's indeed from a specific library and not from R, sorry!)

jrkrideau · December 13, 2020, 8:01pm

That link suggests that the quateda::tokenize function is depreciated which may be a problem and depending on what packages you load, and their order, the readr package also has a tokenize function which may have overridden the quateda function.

I think we need more of your code and some sample data.

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

queerterpreter · December 13, 2020, 8:17pm

I'm trying to follow this sentiment analysis tutorial with those same files (here, I've been trying it with the first one by Adams) except pretty early on I hit

# tokenize
tokens <- data_frame(text = fileText) %>% unnest_tokens(word, text)

and that's deprecated and says to use tibble. So, upon getting there, I've tried

tokenize (apparently from readr), which if it doesn't include the "what" part simply splits by sentences
tinyverse unnest_token, which gives me the below error

Error in check_input(x) : 
  Input must be a character vector of any length or a list of character
  vectors, each of which has a length of 1.

tokenizer's tokenize_words, which gives me the below error and oddly seems to tokenize spaces as their own words, which I could figure out how to clean up but I'm not entirely sure of how to. [ETA: it also puts all tokens in one single line, even though the original data is spread over several lines]

Warning message:
In stri_replace_all_regex(x, c("[<U+FE00>-<U+FE0F>]", "\\s[`-<U+036F>]"),  :
  argument is not an atomic vector; coercing

jrkrideau · December 14, 2020, 8:09pm

I had a quick look at that tutorial last night and again today and it looks rather overly complicated for someone just getting into R.

I know essentially zero about text analysis but a bit about R. What is your level of experience with text analysis?

This is just guessing but I'd suggest sticking with one package, probably tidytext as it is more likely used by forum participants who may be able to help.

Read up a bit on tidytext and work through a couple of examples to get a feeling for what it can do. Text Mining with R looks like a good place to start.

If you have not already seen this, the CRAN Task View: Natural Language Processing might be of general interest

In the meantime, did I understand that you successfully converted that data.frame to a tibble? If so could you post the tibble here? Probably the easiest and most convienient way would be to paste it in dput() format. See ?dput for more information.

If anyone here wants to provide some advice, it is extremely useful to be working from the same data.

Best wishes.

system · January 4, 2021, 8:09pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.