Differences between 2 ways of counting words

Hi, I'm dabbling in textmining and right now I'm trying to find an efficient way to count the number of words in a text.
I have found 2 ways that seem suitable but give me 2 different outputs (the distance, however, is not huge).
Does anyone know why this difference occurs?
The text I'm using is very long so I'm going to put a small test text (it's in Spanish).


texto <- paste("También entiendo que como es temporada de elecciones, las expectativas para lo que lograremos este año son bajas. Aún así, señor Presidente de la Cámara de Representantes, aprecio el enfoque constructivo que usted y los otros líderes adoptaron a finales del año pasado para aprobar un presupuesto, y hacer permanentes los recortes de impuestos para las familias trabajadoras. Así que espero que este año podamos trabajar juntos en prioridades bipartidistas como la reforma de la justicia penal y ayudar a la gente que está luchando contra la adicción a fármacos de prescripción. Tal vez podamos sorprender de nuevo a los cínicos.")

FIRST WAY

library(tidytext)
library(dplyr)

apunte <- data_frame(Text = texto) # tibble aka neater data frame

apunte <- apunte %>% 
  unnest_tokens(output = word, input = Text) 

apunte

apunte_test<- apunte %>% 
  count(word, sort = TRUE)


SECOND WAY

library(udpipe)
library(dplyr)

ud_model <- udpipe_download_model(language = "spanish-gsd")

ud_model <- udpipe_load_model(ud_model$file_model)

x<-udpipe_annotate(ud_model, x = texto)
x <- as.data.frame(x)
x

x_test<- x %>%
  filter(!upos=="PUNCT") %>%
  count(token, sort = T) 

DIFFERENCES

  x_test %>%
  full_join(apunte_test, 
            by= c("token"="word")) %>%
  filter(n.x!=n.y)

 token n.x n.y
1   así   1   2
2    de   8   7
3    el   2   1

I can’t tell if the two objects being compared are the same. Have you compared them to see they are?

Which 2 objects are you referring to? Both 'apunte_test' and 'x_test' are the result of processing the same string ('texto' that I uploaded right before the first way).

One possible cause the the difference is if the function that produces x_test counts del as one occurrence of de and one of el. Can you test if that is true? That does not account for the different counts of así.

Yes, I just checked and I see that the udpipe package fails to correctly identify the word "del" and splits it in 2, thus altering the statistics. That explains the last 2 differences. It is a big problem because it makes it impossible to trust the result for large texts, not knowing if the words are correctly identified.
And regarding the difference with the word "asi", it is because tidytext keeps the uppercase while udpipe converts everything to lowercase.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.