unnest_tokens problem with keyword of "R&D"

textmining

#1

For example, I would like to split “A 40-year-old R&D guy” into “A 40-year-old”, “40-year-old R&D”, “R&D guy” ONLY by space character.

But when I use unnest_tokens(ngram, txt, token = "ngrams", n = 2),

The function automatically replace &(ampersand) and –(dash) into a space and result shows as below.

A 40

40 year

Year old

Old r

R D

D guy.

Please help me have result of paired-word splitted only by space


#2

unnest_tokens with token = "ngrams" will use behind the scene the tokenizer :package: and tokenize_ngrams function. In the help page it is precised:

These functions will strip all punctuation and normalize all whitespace to a single space character.

You have no option to configure that currently. You have to use another option.

The ngram :package: is one of the option as it allows more customization

txt <- "A 40-year-old R&D guy"

library(ngram)
ng <- ngram(txt, n = 2)
ng
#> An ngram object with 3 2-grams
print(ng, output = "full")
#> R&D guy | 1 
#> NULL {1} | 
#> 
#> 40-year-old R&D | 1 
#> guy {1} | 
#> 
#> A 40-year-old | 1 
#> R&D {1} |

# I use rev to have in the order you want
ng_string <- rev(get.ngrams(ng))
ng_string
#> [1] "A 40-year-old"   "40-year-old R&D" "R&D guy"

Created on 2019-01-09 by the reprex package (v0.2.1)

How to use it with unnest_tokens ?

You can provide a function as token argument. This function must work on a vector and return a list. See ?unnest_tokens
Here is an example:

txt <- "A 40-year-old R&D guy"

d <- tibble::data_frame(txt = txt)
# a function that takes a string an return list with each ngrams
ngram_string <- function(txt, n) list(unname(rev(ngram::get.ngrams(ngram::ngram(txt, n = n)))))
ngram_string(txt, 2)
#> [[1]]
#> [1] "A 40-year-old"   "40-year-old R&D" "R&D guy"
# you need to make it vectorize
ngram_string_vec <- Vectorize(ngram_string, vectorize.args = "txt", USE.NAMES = FALSE)
# it works on vector now
ngram_string_vec(c(d$txt, d$txt), n = 2)
#> [[1]]
#> [1] "A 40-year-old"   "40-year-old R&D" "R&D guy"        
#> 
#> [[2]]
#> [1] "A 40-year-old"   "40-year-old R&D" "R&D guy"

# it is ready to be applied as tokenizing function
tidytext::unnest_tokens(d, ngram, txt, token = ngram_string_vec, n = 2, to_lower = FALSE)
#> # A tibble: 3 x 1
#>   ngram          
#>   <chr>          
#> 1 A 40-year-old  
#> 2 40-year-old R&D
#> 3 R&D guy

Created on 2019-01-09 by the reprex package (v0.2.1)


#3

Thank you for your nice reply.
While I am applying your codes, I realized that my source input is not so qualified.

My real input data format is data frame but I overlook the fact that NROW changes after tokenization .

if the input is a data frame as below,
name <- c("John", "Edgar", "James" )
desc <- c("A 40-year-old R&D guy", "no valid information" "nothing")
hr_info <- data.frame(name, desc)

can you code a little more to get output as below?
John A 40-year-old
John 40-year-old R&D
John R&D guy
Edgar no valid
Edgar valid information


#4

I think now you have all the tools you should try by yourself to deal with list of different size.

By tweaking a little bit in several steps the results of ngrams you should be able to get it working. Otherwise you cou try list column and unnesting afterwards.


#5

Hi, cderv.

I executed the below but ended up with "unable to find an inherited method for function ‘ngram’ for signature ‘"factor"’ which I can't debug. :frowning:. That's why I asked additionally.

name <- c("John", "Edgar", "James" )
desc <- c("A 40-year-old R&D guy", "no valid information", "nothing")
hr_info <- data.frame(name, desc)
hr_info %>% tidytext::unnest_tokens(ngram2, name, token = ngram_string_vec, n = 2, to_lower = FALSE)
Show Traceback

Rerun with Debug
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘ngram’ for signature ‘"factor"’


#6

Oh I see.
This is because data.frame converts string as Factor by default.
use data.frame with stringAsFactors = FALSE or use data_frame from tibble and dplyr in the tidyverse.
You'll get your columns as character and no more factors


#7

Thank you for your persistent support.

Your function works in ideal case but I see some problem.
I tried to validate your functions before asking next questions. I hope this would be the last one.

#1. Error when the word is single
when txt is single word like "nothing", the script shows error as below
Error in ngram::ngram(txt, n = n) : input 'str' has nwords=1 and n=2; must have nwords >= n

I could filter out single word before applying your nice functions but would you make it skip(do nothing) for such condition?

#2. Error in showing results in order
When txt = "1 2 3 4 5 6 7 8 9 10", the result shows as below in random sequence.

A tibble: 9 x 1

ngram0

1 2 3
2 6 7
3 5 6
4 3 4
5 4 5
6 9 10
7 8 9
8 1 2
9 7 8

In order to check the result is what I expect to have, it should be in order of FIFO(First In First Out).
Would you have a look to make it display in good order?


#8

I think it is simple enough for you to use a if clause or something else to deal with what you want in a custom function

ngram :package: may not deal very well with this... It is possible that the function ngram don't apply in order and results is not sort FIFO. You should dig into ngram :package:. and maybe open a feature request in the package. You could switch back to tidytext if no punctuation character ? :thinking:
You could also open an issue in tidytext to see if they could support optional punctuation removal.

I don't have all the answer for you, and can't do it on your behalf. I think you have now all the tools.


#9

Thank you for your persistent answers.

Due to words like "40-year-old" in my real cases, I can't go back to default tidytext. But don't have enough programming skills to my own derived ngram. So I follow your last suggestion to create an issue on 'ngram'.

Even though I don't get ordered results, I can make more reliable results with your suggestion. Thank you!


#10

For reference, I post your issue here: