Speeding up R code

Thanks for that. I suppose its a sort of a normalisation...
The code doesn't need to change much to support it, but it can make for longer runtime because depending on the data compared to my original parameter choices, there are more divisions to cut the corpus up into (on average). The time to execute my example goes up to over 2min, or approx 30s when run on 4 cores.

library(purrr) #used for iteration
library(tibble)
library(dplyr)
library(tictoc) # only used for benchmarking the time taken for the code to run


#devtools::install_github("gadenbuie/lorem")
# use lorem ipsum words for example text
set.seed(42)
ordered_words_of_text <- lorem::ipsum_words(3000,collapse = FALSE)


num_words <- length(ordered_words_of_text)
unique_words <- length(unique(ordered_words_of_text))

freq_of_unique_words <- map_int(unique(ordered_words_of_text),
                                ~length(ordered_words_of_text[ordered_words_of_text==.]))

names(freq_of_unique_words) <- unique(ordered_words_of_text)


word_positions_list <- map(
  unique(ordered_words_of_text),
  ~ which(ordered_words_of_text == .)
)

new_vec <- function(vlength, vtrue_pos_list) {
  nv <- vector(length = vlength)
  nv[vtrue_pos_list] <- TRUE
  nv
}

realised_word_pos_list <- map(
  word_positions_list,
  ~ new_vec(num_words, .)
)

avg_red_freq <- function(wordvec, num_words, split_length, split_into) {
  map_dbl(
    0:(split_length - 1),
    ~ aggregate(wordvec, by = list((ceiling((1:num_words + .) / split_length) %% split_into) + 1), max) %>%
      colSums() %>%
      tail(n = 1)
  ) %>% sum() / split_length
}


tictoc::tic(msg = "start sequential algorithm")

arf_vec <- map2_dbl(
  realised_word_pos_list,
  freq_of_unique_words,
  ~ avg_red_freq(
    wordvec = .x,
    num_words = num_words,
    split_length = floor(num_words / .y),
    split_into = .y
  )
)

res1 <- tibble(
  word = unique(ordered_words_of_text),
  arf = arf_vec
) %>% arrange(desc(arf))

tictoc::toc()

## do it in parallel
library(doParallel)
library(foreach)

tictoc::tic(msg = "start 4-core parallel algorithm")
cl <- makeCluster(4)
registerDoParallel(cl)
par_res <- foreach(i = 1:length(realised_word_pos_list),
                   .packages = "purrr") %dopar% {
  avg_red_freq(
    wordvec = realised_word_pos_list[[i]],
    num_words = num_words,
    split_length =  floor(num_words / freq_of_unique_words[[i]]),
    split_into = freq_of_unique_words[[i]]
  )
}

res2 <- tibble(
  word = unique(ordered_words_of_text),
  arf = par_res %>% unlist()
) %>% arrange(desc(arf))

tictoc::toc()

all.equal(res1,res2)
2 Likes

I haven't set up qdap yet, to see if there's a function for that. I was surprised not to find one among all the other NLP tools. Just speculating: might a moving average approach work?

Looks like a viable solution, I'd like to try it on the real dataset. However, now I get the error Error: vector memory exhausted (limit reached?), straight after this bit of code:

realised_word_pos_list <- map(
  word_positions_list,
  ~ new_vec(num_words, .)
)

Did you ever get this error?

Are you on macOS perhaps ? I found this ... https://stackoverflow.com/questions/51295402/r-on-macos-error-vector-memory-exhausted-limit-reached.

I used the peakRAM package to estimate the peak ram consumed by my algorithm, on the sample data I provided.

  Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1  first_algo()           121.36                0.2              51.6

I then ran again, with half the corpus size (1500) to see if the memory scales roughly linear for the corpus.

  Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1  first_algo()            76.93                  0              37.1

so perhaps theres a risk if you run it on 100x corpus size, it might need 5Gb memory ? theres a lot of unkowns...

2 Likes

Could be a C stack limit. See https://stackoverflow.com/questions/43181989/how-to-set-the-size-of-the-c-stack-in-r

Yes, I run it on macOS... I'm trying the solution proposed in that link and it did not give me an error immediately like before.

However, the script has been running for 10h+ now and I haven't even inputted the bit on parallelisation. Do you know of a way to check progress of script-running after it has been started?

It might be that your analysis is too ambitious to be executed in 'reasonable' time given the compute you have access to. (pay to run analysis on a cloud service ?)
Is the arf of every token in your corpus of equal value to you, or would you be interested in the lowest / or / highest. because, perhaps you can get further with a heirarchical approach where you bootstrap sample on smaller subsets of corpus, before narrowing focus on what to analyse.

Anyways, here is a parallel approach with logging and batching. It was interesting to me to work on this.

library(purrr) # used for iteration
library(tibble)
library(dplyr)
library(tictoc) # only used for benchmarking the time taken for the code to run


# devtools::install_github("gadenbuie/lorem")
# use lorem ipsum words for example text
set.seed(42)
ordered_words_of_text <- lorem::ipsum_words(3000, collapse = FALSE)


num_words <- length(ordered_words_of_text)
unique_words <- length(unique(ordered_words_of_text))

freq_of_unique_words <- map_int(
  unique(ordered_words_of_text),
  ~ length(ordered_words_of_text[ordered_words_of_text == .])
)

names(freq_of_unique_words) <- unique(ordered_words_of_text)


word_positions_list <- map(
  unique(ordered_words_of_text),
  ~ which(ordered_words_of_text == .)
)

new_vec <- function(vlength, vtrue_pos_list) {
  nv <- vector(length = vlength)
  nv[vtrue_pos_list] <- TRUE
  nv
}

realised_word_pos_list <- map(
  word_positions_list,
  ~ new_vec(num_words, .)
)

avg_red_freq <- function(wordvec, num_words, split_length, split_into) {
  map_dbl(
    0:(split_length - 1),
    ~ aggregate(wordvec, by = list((ceiling((1:num_words + .) / split_length) %% split_into) + 1), max) %>%
      colSums() %>%
      tail(n = 1)
  ) %>% sum() / split_length
}

# make a tibble to contain results
# every batch run will accumulate into this
final_store <- tibble(
  word = NA_character_,
  arf = NA_real_
) %>% filter(!is.na(word))

library(doParallel)
library(foreach)


writeLines(c(""), "log.txt") # clear out any old log


run_as_parallel_with_batch <- function(final_store,
                                       batch_label,
                                       batch_start,
                                       batch_end,
                                       realised_word_pos_list,
                                       ordered_words_of_text,
                                       freq_of_unique_words,
                                       num_words) {
  batch_of_word_pos_list <- realised_word_pos_list[batch_start:batch_end]
  batch_of_unique_words <- unique(ordered_words_of_text)[batch_start:batch_end]
  batch_freq_of_unique_words <- freq_of_unique_words[batch_start:batch_end]
  ## do it in parallel

  tictoc::tic(msg = "start 4-core parallel algorithm")
  sink("log.txt", append = TRUE)
  cat("\n START batch ", batch_label, " at ", format(Sys.time(), "%X %Y-%m-%d"))
  sink()
  cl <- makeCluster(4)
  registerDoParallel(cl)

  avg_red_freq <- function(wordvec, num_words, split_length, split_into) {
    map_dbl(
      0:(split_length - 1),
      ~ aggregate(wordvec, by = list((ceiling((1:num_words + .) / split_length) %% split_into) + 1), max) %>%
        colSums() %>%
        tail(n = 1)
    ) %>% sum() / split_length
  }

  par_res <- foreach(
    i = 1:length(batch_of_word_pos_list),
    .packages = "purrr"
  ) %dopar% {
    outstring <- paste0(
      "\nprocessing ", i, " of ", length(batch_of_word_pos_list),
      " from batch:", batch_label, " ", batch_start, "-", batch_end, " ", batch_of_unique_words[[i]]
    )
    sink("log.txt", append = TRUE)
    cat(outstring)
    sink()
    avg_red_freq(
      wordvec = batch_of_word_pos_list[[i]],
      num_words = num_words,
      split_length = floor(num_words / batch_freq_of_unique_words[[i]]),
      split_into = batch_freq_of_unique_words[[i]]
    )
  }
  stopCluster(cl)
  tictoc::toc()
  sink("log.txt", append = TRUE)
  cat("\n END batch ", batch_label, " at ", format(Sys.time(), "%X %Y-%m-%d"))
  cat("\n")
  sink()
  txt_of_log <- readLines(con = "log.txt")
  walk(txt_of_log, print)

  res2 <- tibble(
    word = batch_of_unique_words,
    arf = par_res %>% unlist()
  ) %>% arrange(desc(arf))

  final_store <- union(final_store, res2)
}

## EXAMPLE OF RUNNING AN ARBITRARY BATCH, I.E. COULD REPEAT RUN THIS MANUALLY with differenct batch start and end parameters
# final_store <- run_as_parallel_with_batch(
#   final_store = final_store,
#   batch_label = 2,
#   batch_start = 1,
#   batch_end = 20,
#   realised_word_pos_list = realised_word_pos_list,
#   ordered_words_of_text = ordered_words_of_text,
#   freq_of_unique_words = freq_of_unique_words,
#   num_words = num_words
# )
# final_store

# EXAMPLE OF SETTING UP purrr pwalk to cover some of these
# my example data of 176 unique words divides into 8 batches length 22
num_tot_batches <- 8
batch_length <- 22
batch_labels <- 1:num_tot_batches
batch_starts <- (batch_labels - 1) * batch_length + 1
batch_ends   <- (batch_labels - 1) * batch_length + batch_length

## param pass demo
pwalk(
  .l = list(
    batch_labels,
    batch_starts,
    batch_ends
  ),
  .f = ~ cat("\n", ..1, " ", ..2, " ", ..3)
)

pwalk(
  .l = list(
    batch_labels,
    batch_starts,
    batch_ends
  ),
  .f = ~ {
    final_store <<- run_as_parallel_with_batch(
      final_store = final_store,
      batch_label = ..1,
      batch_start = ..2,
      batch_end = ..3,
      realised_word_pos_list = realised_word_pos_list,
      ordered_words_of_text = ordered_words_of_text,
      freq_of_unique_words = freq_of_unique_words,
      num_words = num_words
    )
  }
)

final_store %>% arrange(desc(arf))
> final_store %>% arrange(desc(arf))
# A tibble: 176 x 2
   word        arf
   <chr>     <dbl>
 1 auctor     18.5
 2 fusce      18.3
 3 ultrices   18  
 4 ut         16.7
 5 accumsan   16.6
 6 hendrerit  16.0
 7 tincidunt  16.0
 8 magna      15.9
 9 dignissim  15.7
10 vehicula   15.5
# ... with 166 more rows
2 Likes

Hey guys, sorry for the late reply and thanks for the answers.
I've tried running all your solutions, but there's no way my macOS alone manages to deal with them without crashing. I've thought about subsetting the lemmas I'm interested in, but the real issues in time/memory consume seems to rather be the corpus itself, which can't be cut off.

Right now I'm proceding slowly, calculating ARFs for a subset of lemmas at a time, to give my machine some rest.

I've noticed an improvement by reducing all data frames' columns into simple vectors, and looking at the new profile it seems that the only bit slowing the code down is appending. It would be really useful to have a way of counting how many members of a list contain a certain value, rather than asking the code to check each individual list member, then return a 1 or 0 value and then sum.

I guess your attempted approach was to get 100% exact values for ARF from the corpus. Would it be of value to cut the corpus into 10 parts, and process each independently. giving you 10Arfs per term. You could then work with the statistical distribution of a terms Arf to estimate, its 'true' value , in a partially statistical approach. ?

1 Like

Or bootstrap sampling?

HI all,

I could not resist delving into this seeing the title "Speeding up R code" as that's one of the challenges I like the most in R :slight_smile:

I started out implementing like you guys the algorithm as explained on the referenced site. I got a fun time trying to come up with my own solution using a matrix implementation, but still some of the intermediate calculations were slow and not very practical.

Researching more I stumbled across this site that gave a more theoretical explanation. And it took me a few reads to realize that there is just a formula to calculate this ARF based on the distances between words in the text. The visual explanation in the first reference was just to show the concept, but the formula is way more elegant and faster than that approach. So we were all a bit sidetracked I think.

Based on this new information, I came up with this code:

library(dplyr)
library(purrr)

#Supply a vector of words in the order they appear in the text
ARF = function(myText){
  
  #Convert text to numerical factor vector
  myText = as.factor(myText)
  numericText = as.integer(myText)
  
  nTokens = length(myText)
  uniqueWords = nlevels(myText)
  
  #For every unique word, calculate its positions in the text (0-indexed so -1 for R)
  occurences = map(1:uniqueWords, function(x) which(numericText == x) - 1)
  
  #Calculate the ARF with the formula (https://wiki.korpus.cz/doku.php/en:pojmy:arf)
  result = map(1:uniqueWords, function(x){
    pos = occurences[[x]]
    
    occurence = length(pos)
    lengthParts = nTokens / occurence #The length of each part can be fractional
    
    #Calculate the distance between all occurences (inclusing last and first)
    dist = c(pos[-1], nTokens) - pos
    
    #Plug into formula
    sum(sapply(dist, function(x) min(x, lengthParts))) / lengthParts
  })
  
  data.frame(word = levels(myText), ARF = sapply(result, "[[", 1))
  
}

Let's test it on the example given in the first link

myText = rep("-", 60)
myText[c(0, 11, 13, 16, 56)+1] = "+"

ARF(myText)

  word       ARF
1    - 50.750000
2    +  2.666667

So this seems to be working...

Now the example as provided by @nirgrahamuk

library(lorem)

#Generate random text
set.seed(42)
myText = ipsum_words(3000, collapse = FALSE)

myARF = ARF(myText)
head(myARF)

     word       ARF
1        a 14.264000
2       ac 11.760000
3 accumsan 16.583333
4       ad  7.456000
5   aenean  9.060000
6  aliquam  9.190333

I tried this for 3000, 10000 and 100000 words and the result in near instantaneous! Of note, the lorem package has only 176 unique words so in reality that might be a bit slower in a case where there are many more unique words.

The values I get are sometimes a little different from @nirgrahamuk, but I think this is because he tried to workaround the case where the occurrence is not a multiple of the total length (e.g. 60 / 5 is OK, but 62 / 5 is not). Because I use the formula instead of the graphical implementation, I can work with fractional numbers, although you can't draw those groups on a piece of paper. In the text they say if you did it graphically, you have to have groups that are the same size +/-1 so get around this.

Please check if what I did is correct, and hope it helps!

PJ

3 Likes

fantastic, the mathematics to calculate from a list of word positions rather than the 'mechanical' approach I was working with is gold

That is amazing, thank you so much!
It seems to work very well: as you say, there are differences in our ARFs vs the ones from this method, but that must be due to the necessary workaround when occurrence is not a multiple of the total length. I will just need to exclude those with ARF < 1.
Pardon the dumb question: what does 0-indexed mean (hence the need for -1 when applying the which function)?

this relates to conventions around numbering from 0 as first element on up, or from 1 as first element ...

1 Like