Speeding up R code

I had a go at implementing this. I generated Ipsum Lorem example text 3000 words, and ran my basic algorithm on it, timed about 24 seconds on my laptop. I then tried running on 4cores of said laptop, and got speed of about 8 seconds (or 3x faster). Have a look !

Note: I started with quite a tidyverse heavy approach, with map_dfr building a tibble, grouping it, max count by grouping it and summarising it, but I found there was significant overhead in that, and it was better to use the base r aggregate() function for my slowest part of the code so thats why that is there.

library(purrr) #used for iteration
library(tibble)
library(dplyr)
library(tictoc) # only used for benchmarking the time taken for the code to run


#devtools::install_github("gadenbuie/lorem")
# use lorem ipsum words for example text
set.seed(42)
ordered_words_of_text <- lorem::ipsum_words(3000,collapse = FALSE)


num_words <- length(ordered_words_of_text)
unique_words <- length(unique(ordered_words_of_text))

split_into <- 100

split_length <- num_words / split_into



word_positions_list <- map(
  unique(ordered_words_of_text),
  ~ which(ordered_words_of_text == .)
)

new_vec <- function(vlength, vtrue_pos_list) {
  nv <- vector(length = vlength)
  nv[vtrue_pos_list] <- TRUE
  nv
}

realised_word_pos_list <- map(
  word_positions_list,
  ~ new_vec(num_words, .)
)



tictoc::tic(msg = "start sequential algorithm")
avg_red_freq <- function(wordvec, num_words, split_length, split_into) {
  map_dbl(
    0:(split_length - 1),
    ~ aggregate(wordvec, by = list((ceiling((1:num_words + .) / split_length) %% split_into) + 1), max) %>%
      colSums() %>%
      tail(n = 1)
  ) %>% sum() / split_length
}

arf_vec <- map_dbl(
  realised_word_pos_list,
  ~ avg_red_freq(
    wordvec = .,
    num_words = num_words,
    split_length = split_length,
    split_into = split_into
  )
)

res1 <- tibble(
  word = unique(ordered_words_of_text),
  arf = arf_vec
) %>% arrange(desc(arf))

tictoc::toc()

## do it in parallel
library(doParallel)
library(foreach)

tictoc::tic(msg = "start 4-core parallel algorithm")
cl <- makeCluster(4)
registerDoParallel(cl)
par_res <- foreach(i = 1:length(realised_word_pos_list), .packages = "purrr") %dopar% {
  avg_red_freq(
    wordvec = realised_word_pos_list[[i]],
    num_words = num_words,
    split_length = split_length,
    split_into = split_into
  )
}

res2 <- tibble(
  word = unique(ordered_words_of_text),
  arf = par_res %>% unlist()
) %>% arrange(desc(arf))

tictoc::toc()

all.equal(res1,res2) # prove the parallel version gave same result, just faster
2 Likes

That's quite impressive!
The only thing is: split_into is supposed to be a different value for every PoS for which the avg_red_freq is calculated.
Do I need to include a variable for it or would this mess up the following code (i.e. for the code to work as it does I need to keep a fixed split_into value)?

Can you say more about this ? it seems counterintuitive, assuming that you are analysing a single text, and want comparative metrics for the words in it. Presumably with different splits, the measures will lose comparative meaning.... ?
How would you calculate the split_into to use per word ?

1 Like

I am actually analysing multiple texts put together in a single corpus, where raw frequencies and simple adjusted frequencies can be misleading. The ARF should instead represent the frequency which a lemma+PoS would have if it were distributed homogeneously in the dataset. I know it's operationally convoluted, but this is the only way to calculate it. An example:

Our data frame has 327623 tokens.
The lemma 'house' (as a noun, hence the need for lemma-PoS, since it could be a verb, for example) has a raw frequency in the whole data frame of 946.
To calculate its ARF we first need to calculate all its reduced frequencies. To do that we split the data frame into 946 chunks, so that we get chunks of length 327623 / 946 (we will need to round this up, or down, obviously). Then I count how many chunks contain 'house'. It does not matter if there is more than 1 in a chunk, it will still count as 1 occurrence (this is the very principle of reduced frequency: if the word is equally distributed in the data frame, then if I split the data frame into 946 chunks I should theoretically find exactly 1 occurrence per chunk, which is clearly almost never the case).
The count of chunks containing 'house' is only one reduced frequency. We need to calculate as many reduce frequencies for 'house' as there are rows in a chunk: every time we split the data frame in chunks, we split it starting from i + 1 token position and each time the reduced frequency will likely differ. We then do the average of all reduced frequencies for 'house' and move to the next lemma+PoS, repeating the process until all lemmas+PoS are done and we have the ARF of all of them.

I currently have:

  • A dataset with 327024 rows, each corresponding to a token reduced to their lemma-PoS (e.g. playing > play-VERB, played > play-VERB)
  • A frequency list of lemma-PoS (col1 = the lemma-PoS, e.g. 'house-NOUN', col2 = its raw frequency in the dataset, e.g. 946).

My first code above was trying to:

  • Take the value (n) in row1, col2 of the frequency list -> split the dataset into n chunks (so each chunk has 347 tokens [rounded up from 346.32452431]).
  • Count the chunks containing the lemma-PoS in row1, col1, and store the value (reduced frequency) in a vector for later.
  • Split the data frame again into n chunks, this time starting from the second token in the data frame (i.e.: in the first round, chunk1 included tokens 1: 347, now chunk1 includes tokens 2 : 348 [and token 1 is moved at the bottom of the data frame, included in chunk946). Calculate reduced frequency. Repeat 347 times.
  • ARF = mean of all 347 reduced frequencies.

Let me know if there is anything more I could explain.

The code I gave you at the beginning works, but it takes circa 1 minute to calculate 1 ARF, so that to calculate the ARF of all lemma-PoS I would need roughly 12 days.

Maybe you could run your code that you began with against my example data, to compare for yourself the relative speed of calculation. And if the metrics agree.
The issue I have is that the decision to divide the total text into n chunks with n based on the raw frequency seems arbitrary and not justified by any argument or theory that you presented ... Can you fill in this gap for me.
I don't think that for a given text (your corpus) you could only get useful info about any two words respective arfs if they were calculated in a like for like way and the differing chunk choice would seem to invalidate that for me. How confident are you that chunking by the raw frequency is merited? The initial document specifying the arf calculation that you linked to does not hint at this from my reading of it.
Sorry for any misunderstanding

Of course, I understand your scepticism! I can point at the relevant literature arguing for this approach:

  1. http://uivty.cs.cas.cz/~savicky/papers/commonness.pdf
  2. http://lrec.elra.info/proceedings/lrec2006/pdf/11_pdf.pdf

See in particular 1) on p. 4, the authors explain why is arf used (they surely make a much better job at explaining the principle than me even trying to!).

I hope this help!

Thanks for that. I suppose its a sort of a normalisation...
The code doesn't need to change much to support it, but it can make for longer runtime because depending on the data compared to my original parameter choices, there are more divisions to cut the corpus up into (on average). The time to execute my example goes up to over 2min, or approx 30s when run on 4 cores.

library(purrr) #used for iteration
library(tibble)
library(dplyr)
library(tictoc) # only used for benchmarking the time taken for the code to run


#devtools::install_github("gadenbuie/lorem")
# use lorem ipsum words for example text
set.seed(42)
ordered_words_of_text <- lorem::ipsum_words(3000,collapse = FALSE)


num_words <- length(ordered_words_of_text)
unique_words <- length(unique(ordered_words_of_text))

freq_of_unique_words <- map_int(unique(ordered_words_of_text),
                                ~length(ordered_words_of_text[ordered_words_of_text==.]))

names(freq_of_unique_words) <- unique(ordered_words_of_text)


word_positions_list <- map(
  unique(ordered_words_of_text),
  ~ which(ordered_words_of_text == .)
)

new_vec <- function(vlength, vtrue_pos_list) {
  nv <- vector(length = vlength)
  nv[vtrue_pos_list] <- TRUE
  nv
}

realised_word_pos_list <- map(
  word_positions_list,
  ~ new_vec(num_words, .)
)

avg_red_freq <- function(wordvec, num_words, split_length, split_into) {
  map_dbl(
    0:(split_length - 1),
    ~ aggregate(wordvec, by = list((ceiling((1:num_words + .) / split_length) %% split_into) + 1), max) %>%
      colSums() %>%
      tail(n = 1)
  ) %>% sum() / split_length
}


tictoc::tic(msg = "start sequential algorithm")

arf_vec <- map2_dbl(
  realised_word_pos_list,
  freq_of_unique_words,
  ~ avg_red_freq(
    wordvec = .x,
    num_words = num_words,
    split_length = floor(num_words / .y),
    split_into = .y
  )
)

res1 <- tibble(
  word = unique(ordered_words_of_text),
  arf = arf_vec
) %>% arrange(desc(arf))

tictoc::toc()

## do it in parallel
library(doParallel)
library(foreach)

tictoc::tic(msg = "start 4-core parallel algorithm")
cl <- makeCluster(4)
registerDoParallel(cl)
par_res <- foreach(i = 1:length(realised_word_pos_list),
                   .packages = "purrr") %dopar% {
  avg_red_freq(
    wordvec = realised_word_pos_list[[i]],
    num_words = num_words,
    split_length =  floor(num_words / freq_of_unique_words[[i]]),
    split_into = freq_of_unique_words[[i]]
  )
}

res2 <- tibble(
  word = unique(ordered_words_of_text),
  arf = par_res %>% unlist()
) %>% arrange(desc(arf))

tictoc::toc()

all.equal(res1,res2)
2 Likes

I haven't set up qdap yet, to see if there's a function for that. I was surprised not to find one among all the other NLP tools. Just speculating: might a moving average approach work?

Looks like a viable solution, I'd like to try it on the real dataset. However, now I get the error Error: vector memory exhausted (limit reached?), straight after this bit of code:

realised_word_pos_list <- map(
  word_positions_list,
  ~ new_vec(num_words, .)
)

Did you ever get this error?

Are you on macOS perhaps ? I found this ... https://stackoverflow.com/questions/51295402/r-on-macos-error-vector-memory-exhausted-limit-reached.

I used the peakRAM package to estimate the peak ram consumed by my algorithm, on the sample data I provided.

  Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1  first_algo()           121.36                0.2              51.6

I then ran again, with half the corpus size (1500) to see if the memory scales roughly linear for the corpus.

  Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1  first_algo()            76.93                  0              37.1

so perhaps theres a risk if you run it on 100x corpus size, it might need 5Gb memory ? theres a lot of unkowns...

2 Likes

Could be a C stack limit. See https://stackoverflow.com/questions/43181989/how-to-set-the-size-of-the-c-stack-in-r

Yes, I run it on macOS... I'm trying the solution proposed in that link and it did not give me an error immediately like before.

However, the script has been running for 10h+ now and I haven't even inputted the bit on parallelisation. Do you know of a way to check progress of script-running after it has been started?

It might be that your analysis is too ambitious to be executed in 'reasonable' time given the compute you have access to. (pay to run analysis on a cloud service ?)
Is the arf of every token in your corpus of equal value to you, or would you be interested in the lowest / or / highest. because, perhaps you can get further with a heirarchical approach where you bootstrap sample on smaller subsets of corpus, before narrowing focus on what to analyse.

Anyways, here is a parallel approach with logging and batching. It was interesting to me to work on this.

library(purrr) # used for iteration
library(tibble)
library(dplyr)
library(tictoc) # only used for benchmarking the time taken for the code to run


# devtools::install_github("gadenbuie/lorem")
# use lorem ipsum words for example text
set.seed(42)
ordered_words_of_text <- lorem::ipsum_words(3000, collapse = FALSE)


num_words <- length(ordered_words_of_text)
unique_words <- length(unique(ordered_words_of_text))

freq_of_unique_words <- map_int(
  unique(ordered_words_of_text),
  ~ length(ordered_words_of_text[ordered_words_of_text == .])
)

names(freq_of_unique_words) <- unique(ordered_words_of_text)


word_positions_list <- map(
  unique(ordered_words_of_text),
  ~ which(ordered_words_of_text == .)
)

new_vec <- function(vlength, vtrue_pos_list) {
  nv <- vector(length = vlength)
  nv[vtrue_pos_list] <- TRUE
  nv
}

realised_word_pos_list <- map(
  word_positions_list,
  ~ new_vec(num_words, .)
)

avg_red_freq <- function(wordvec, num_words, split_length, split_into) {
  map_dbl(
    0:(split_length - 1),
    ~ aggregate(wordvec, by = list((ceiling((1:num_words + .) / split_length) %% split_into) + 1), max) %>%
      colSums() %>%
      tail(n = 1)
  ) %>% sum() / split_length
}

# make a tibble to contain results
# every batch run will accumulate into this
final_store <- tibble(
  word = NA_character_,
  arf = NA_real_
) %>% filter(!is.na(word))

library(doParallel)
library(foreach)


writeLines(c(""), "log.txt") # clear out any old log


run_as_parallel_with_batch <- function(final_store,
                                       batch_label,
                                       batch_start,
                                       batch_end,
                                       realised_word_pos_list,
                                       ordered_words_of_text,
                                       freq_of_unique_words,
                                       num_words) {
  batch_of_word_pos_list <- realised_word_pos_list[batch_start:batch_end]
  batch_of_unique_words <- unique(ordered_words_of_text)[batch_start:batch_end]
  batch_freq_of_unique_words <- freq_of_unique_words[batch_start:batch_end]
  ## do it in parallel

  tictoc::tic(msg = "start 4-core parallel algorithm")
  sink("log.txt", append = TRUE)
  cat("\n START batch ", batch_label, " at ", format(Sys.time(), "%X %Y-%m-%d"))
  sink()
  cl <- makeCluster(4)
  registerDoParallel(cl)

  avg_red_freq <- function(wordvec, num_words, split_length, split_into) {
    map_dbl(
      0:(split_length - 1),
      ~ aggregate(wordvec, by = list((ceiling((1:num_words + .) / split_length) %% split_into) + 1), max) %>%
        colSums() %>%
        tail(n = 1)
    ) %>% sum() / split_length
  }

  par_res <- foreach(
    i = 1:length(batch_of_word_pos_list),
    .packages = "purrr"
  ) %dopar% {
    outstring <- paste0(
      "\nprocessing ", i, " of ", length(batch_of_word_pos_list),
      " from batch:", batch_label, " ", batch_start, "-", batch_end, " ", batch_of_unique_words[[i]]
    )
    sink("log.txt", append = TRUE)
    cat(outstring)
    sink()
    avg_red_freq(
      wordvec = batch_of_word_pos_list[[i]],
      num_words = num_words,
      split_length = floor(num_words / batch_freq_of_unique_words[[i]]),
      split_into = batch_freq_of_unique_words[[i]]
    )
  }
  stopCluster(cl)
  tictoc::toc()
  sink("log.txt", append = TRUE)
  cat("\n END batch ", batch_label, " at ", format(Sys.time(), "%X %Y-%m-%d"))
  cat("\n")
  sink()
  txt_of_log <- readLines(con = "log.txt")
  walk(txt_of_log, print)

  res2 <- tibble(
    word = batch_of_unique_words,
    arf = par_res %>% unlist()
  ) %>% arrange(desc(arf))

  final_store <- union(final_store, res2)
}

## EXAMPLE OF RUNNING AN ARBITRARY BATCH, I.E. COULD REPEAT RUN THIS MANUALLY with differenct batch start and end parameters
# final_store <- run_as_parallel_with_batch(
#   final_store = final_store,
#   batch_label = 2,
#   batch_start = 1,
#   batch_end = 20,
#   realised_word_pos_list = realised_word_pos_list,
#   ordered_words_of_text = ordered_words_of_text,
#   freq_of_unique_words = freq_of_unique_words,
#   num_words = num_words
# )
# final_store

# EXAMPLE OF SETTING UP purrr pwalk to cover some of these
# my example data of 176 unique words divides into 8 batches length 22
num_tot_batches <- 8
batch_length <- 22
batch_labels <- 1:num_tot_batches
batch_starts <- (batch_labels - 1) * batch_length + 1
batch_ends   <- (batch_labels - 1) * batch_length + batch_length

## param pass demo
pwalk(
  .l = list(
    batch_labels,
    batch_starts,
    batch_ends
  ),
  .f = ~ cat("\n", ..1, " ", ..2, " ", ..3)
)

pwalk(
  .l = list(
    batch_labels,
    batch_starts,
    batch_ends
  ),
  .f = ~ {
    final_store <<- run_as_parallel_with_batch(
      final_store = final_store,
      batch_label = ..1,
      batch_start = ..2,
      batch_end = ..3,
      realised_word_pos_list = realised_word_pos_list,
      ordered_words_of_text = ordered_words_of_text,
      freq_of_unique_words = freq_of_unique_words,
      num_words = num_words
    )
  }
)

final_store %>% arrange(desc(arf))
> final_store %>% arrange(desc(arf))
# A tibble: 176 x 2
   word        arf
   <chr>     <dbl>
 1 auctor     18.5
 2 fusce      18.3
 3 ultrices   18  
 4 ut         16.7
 5 accumsan   16.6
 6 hendrerit  16.0
 7 tincidunt  16.0
 8 magna      15.9
 9 dignissim  15.7
10 vehicula   15.5
# ... with 166 more rows
2 Likes

Hey guys, sorry for the late reply and thanks for the answers.
I've tried running all your solutions, but there's no way my macOS alone manages to deal with them without crashing. I've thought about subsetting the lemmas I'm interested in, but the real issues in time/memory consume seems to rather be the corpus itself, which can't be cut off.

Right now I'm proceding slowly, calculating ARFs for a subset of lemmas at a time, to give my machine some rest.

I've noticed an improvement by reducing all data frames' columns into simple vectors, and looking at the new profile it seems that the only bit slowing the code down is appending. It would be really useful to have a way of counting how many members of a list contain a certain value, rather than asking the code to check each individual list member, then return a 1 or 0 value and then sum.

I guess your attempted approach was to get 100% exact values for ARF from the corpus. Would it be of value to cut the corpus into 10 parts, and process each independently. giving you 10Arfs per term. You could then work with the statistical distribution of a terms Arf to estimate, its 'true' value , in a partially statistical approach. ?

1 Like

Or bootstrap sampling?

HI all,

I could not resist delving into this seeing the title "Speeding up R code" as that's one of the challenges I like the most in R :slight_smile:

I started out implementing like you guys the algorithm as explained on the referenced site. I got a fun time trying to come up with my own solution using a matrix implementation, but still some of the intermediate calculations were slow and not very practical.

Researching more I stumbled across this site that gave a more theoretical explanation. And it took me a few reads to realize that there is just a formula to calculate this ARF based on the distances between words in the text. The visual explanation in the first reference was just to show the concept, but the formula is way more elegant and faster than that approach. So we were all a bit sidetracked I think.

Based on this new information, I came up with this code:

library(dplyr)
library(purrr)

#Supply a vector of words in the order they appear in the text
ARF = function(myText){
  
  #Convert text to numerical factor vector
  myText = as.factor(myText)
  numericText = as.integer(myText)
  
  nTokens = length(myText)
  uniqueWords = nlevels(myText)
  
  #For every unique word, calculate its positions in the text (0-indexed so -1 for R)
  occurences = map(1:uniqueWords, function(x) which(numericText == x) - 1)
  
  #Calculate the ARF with the formula (https://wiki.korpus.cz/doku.php/en:pojmy:arf)
  result = map(1:uniqueWords, function(x){
    pos = occurences[[x]]
    
    occurence = length(pos)
    lengthParts = nTokens / occurence #The length of each part can be fractional
    
    #Calculate the distance between all occurences (inclusing last and first)
    dist = c(pos[-1], nTokens) - pos
    
    #Plug into formula
    sum(sapply(dist, function(x) min(x, lengthParts))) / lengthParts
  })
  
  data.frame(word = levels(myText), ARF = sapply(result, "[[", 1))
  
}

Let's test it on the example given in the first link

myText = rep("-", 60)
myText[c(0, 11, 13, 16, 56)+1] = "+"

ARF(myText)

  word       ARF
1    - 50.750000
2    +  2.666667

So this seems to be working...

Now the example as provided by @nirgrahamuk

library(lorem)

#Generate random text
set.seed(42)
myText = ipsum_words(3000, collapse = FALSE)

myARF = ARF(myText)
head(myARF)

     word       ARF
1        a 14.264000
2       ac 11.760000
3 accumsan 16.583333
4       ad  7.456000
5   aenean  9.060000
6  aliquam  9.190333

I tried this for 3000, 10000 and 100000 words and the result in near instantaneous! Of note, the lorem package has only 176 unique words so in reality that might be a bit slower in a case where there are many more unique words.

The values I get are sometimes a little different from @nirgrahamuk, but I think this is because he tried to workaround the case where the occurrence is not a multiple of the total length (e.g. 60 / 5 is OK, but 62 / 5 is not). Because I use the formula instead of the graphical implementation, I can work with fractional numbers, although you can't draw those groups on a piece of paper. In the text they say if you did it graphically, you have to have groups that are the same size +/-1 so get around this.

Please check if what I did is correct, and hope it helps!

PJ

3 Likes

fantastic, the mathematics to calculate from a list of word positions rather than the 'mechanical' approach I was working with is gold

That is amazing, thank you so much!
It seems to work very well: as you say, there are differences in our ARFs vs the ones from this method, but that must be due to the necessary workaround when occurrence is not a multiple of the total length. I will just need to exclude those with ARF < 1.
Pardon the dumb question: what does 0-indexed mean (hence the need for -1 when applying the which function)?

this relates to conventions around numbering from 0 as first element on up, or from 1 as first element ...

1 Like