Speeding up R code

nilspadraig · March 22, 2020, 8:52am

Hi R experts,

I'm going mental trying to make the following code run faster. I'm quite new to R, therefore need targeted help as I'm not sure where to start.
As a bit of context: the code runs on a df with 300k rows. The code splits this df into smaller chunks, each with n rows (n = nrow(df)/(another number taken from another df)). Then it checks if a value is in each chunk, returning 1 or 0, which are then summed and the value is stored in vector called rf. Process is repeated n times, each time splitting the df starting from one lower row until row number n is reached. All values in rf are then averaged and this value is stored in a vector called arf. Process is repeated until all values in the second df are covered.

Here's the code

arf <- vector() 
n <- nrow(data_frame_a)
e <-1

repeat {
  chunk <- ceiling(as.vector(nrow(data_frame_a)/data_frame_b[e, 2],mode='numeric')) 
  s <- as.vector(n/chunk,mode='numeric') # number of chunks
  r <- rep(1:ceiling(s),each=floor(n/s))[1:n] 
  rf <- vector()
  
  a <-1
  b <-2
  c <-0
  d <-0 
  
  repeat {
    df1 <- data_frame_a
    df1 <- rbind(df1[a,],df1[b:(nrow(df1)),],df1[c:d,])
    df2 <- split(df1, r)
    u <- 1
    logval <- vector()
    
    repeat {
      logval <- c(logval, (ifelse(data_frame_b[e, 1] %in% df2[[u]]$col1, as.numeric("1"), as.numeric("0"))))
      u = u+1
      if (u == ceiling(s)) {
        break
      }
    }
    rf <- c(rf, sum(logval))
    a = a+1
    b = b+1
    c = 1 
    d = d+1
    if (d == (ceiling(chunk))) {
      break
    }}
  arf <- c(arf, mean(rf))
  e = e+1
  if (e == nrow(data_frame_b)) {
    break }
}

Any help hugely appreciated guys!!

Thank you in advance,

Nils

technocrat · March 22, 2020, 9:03am

It’s almost 2am here, so I’ll have to get back near noon PDT. Short answer: this wants vectorization and a slidr package type windowing function, to avoid all the creation/destruction of interim objects,with purrr:map.

nilspadraig · March 22, 2020, 9:05am

That's already quite useful, thanks for your help. I'll see what I can do and post it here as this might be useful to other users too.

Cheers!

nirgrahamuk · March 22, 2020, 12:11pm

it would likely be impractical for you to share data.frames with 300k records.
but you could provide a representative sample of such a frame using sample() function or head() or some method. and with a smaller frame, make it copy and pasteable to this forum via dput(), then we could profile the code.
You could run the profiler yourself of course, its often useful to see what functions account for the greatest proportion of the processing time.
Also, im wondering if your process is parallelizable, my instinct is that it would be, this would mean that you could get all your cpu cores involved in calculating your results and reduce your runtime. something like the parallel and foreach packages perhaps.
Of course, I'd only make such performance improvement efforts, for writing code that will be run again and again over many days, and potentially by others. If its one time code, for a one shot analyis, I'd probably just be content to let the laptop work over night while I slept

nilspadraig · March 22, 2020, 3:16pm

I've made a profile of a sample. If I share the profile as a file perhaps you could help me understand it? (I need to attach the files in four separate posts, sorry.)

Let me know if this is clear enough!
Thanks so much

nilspadraig · March 22, 2020, 3:16pm

nilspadraig · March 22, 2020, 3:17pm

nilspadraig · March 22, 2020, 3:18pm

nirgrahamuk · March 22, 2020, 3:20pm

where the bars are longest, is where the code is spending time or 'slowest' .
Could you try to provide your data, I'd rather try to optimise your code myself than dictate things to try to you....

nilspadraig · March 22, 2020, 4:59pm

Sure, here a subset of the two datasets needed (500 lines of the data_frame1 and 50 lines of data_frame2). Just to recap, the goal of the script is:

For each row in data_frame_b: divide data_frame_1 into n chunks (n= value in data_frame_2, column 2.
Check if the character value in data_frame_2 column 1 is in each of those chunks and assign a binary 1 vs. 0 numeric value for every chunk in which it is/it isn't found; then sum the 1's and store the new value in a vector (rf)
Repeat by splitting data_frame_1 into n chunks again, but this time splitting it from row 2 (so that row 1 will be included in the last chunk instead). Repeat process m times (m = number of rows in a chunk).
Calculate average of rf and store value in a vector (arf).
Repeat for each row of data_frame_2.

Here's the data frames:

data_frame_a

structure(list(Text = c("Aliens", "love", "underpants", "Of",
"every", "shape", "and", "size", "But", "there", "are", "no",
"underpants", "in", "space", "So", "here", "'s", "the", "big",
"surprise", "When", "aliens", "fly", "down", "to", "Earth", "They",
"do", "n't", "come", "to", "meet", "YOU", "They", "simply", "want",
"your", "underpants", "I", "'ll", "bet", "you", "never", "knew",
"Their", "spaceships", "'s", "radar", "bleeps", "and", "blinks",
"The", "moment", "that", "it", "sees", "A", "washing", "line",
"of", "underpants", "All", "flapping", "in", "the", "breeze",
"They", "land", "in", "your", "back", "garden", "Though", "they",
"have", "n't", "been", "invited", "Oooooh", "UNDERPANTS", "they",
"chant", "And", "dance", "around", "delighted", "They", "like",
"them", "red", "they", "like", "them", "green", "Or", "orange",
"like", "satsumas", "But", "best", "of", "all", "they", "love",
"the", "sight", "Of", "Granny", "'s", "spotted", "bloomers",
"Mum", "'s", "pink", "frilly", "knickers", "Are", "a", "perfect",
"place", "to", "hide", "And", "Grandpa", "'s", "woolly", "long",
"johns", "Make", "a", "super-whizzy", "slide", "In", "daring",
"competitions", "Held", "up", "by", "just", "one", "peg", "They",
"count", "how", "many", "aliens", "Can", "squeeze", "into", "each",
"leg", "They", "wear", "pants", "on", "their", "feet", "and",
"heads", "And", "other", "silly", "places", "They", "fly", "pants",
"from", "their", "spaceships", "and", "Hold", "Upside-Down-Pant",
"Races", "As", "they", "go", "zinging", "through", "the", "air",
"It", "really", "is", "pants-tastic", "What", "fun", "the", "aliens",
"can", "have", "With", "pingy", "pants", "elastic", "It", "'s",
"not", "your", "neighbour"), col2 = c("alien_NNS", "love_VBP",
"underpants_NNS", "of_IN", "every_DT", "shape_NN", "and_CC",
"size_NN", "but_CC", "there_EX", "be_VBP", "no_DT", "underpants_NNS",
"in_IN", "space_NN", "so_RB", "here_RB", "be_VBZ", "the_DT",
"big_JJ", "surprise_NN", "when_WRB", "alien_NNS", "fly_VBP",
"down_RB", "to_TO", "earth_NNP", "they_PRP", "do_VBP", "not_RB",
"come_VB", "to_TO", "meet_VB", "you_PRP", "they_PRP", "simply_RB",
"want_VB", "you_PRP$", "underpants_NNS", "i_PRP", "will_MD",
"bet_VB", "you_PRP", "never_RB", "know_VBD", "they_PRP$", "spaceship_NNS",
"'s_POS", "radar_NN", "bleep_NNS", "and_CC", "blink_VBZ", "the_DT",
"moment_NN", "that_IN", "it_PRP", "see_VBZ", "a_DT", "wash_VBG",
"line_NN", "of_IN", "underpants_NNS", "all_DT", "flap_VBG", "in_IN",
"the_DT", "breeze_NN", "they_PRP", "land_VBP", "in_IN", "you_PRP$",
"back_JJ", "garden_NN", "though_IN", "they_PRP", "have_VBP",
"not_RB", "be_VBN", "invite_VBN", "oooooh_NNP", "underpants_NNP",
"they_PRP", "chant_VBP", "and_CC", "dance_NN", "around_RB", "delighted_JJ",
"they_PRP", "like_VBP", "they_PRP", "red_JJ", "they_PRP", "like_VBP",
"they_PRP", "green_JJ", "or_CC", "orange_NN", "like_IN", "satsuma_NNS",
"but_CC", "best_JJS", "of_IN", "all_DT", "they_PRP", "love_VBP",
"the_DT", "sight_NN", "of_IN", "granny_NNP", "'s_POS", "spotted_JJ",
"bloomers_NNS", "mum_NNP", "'s_POS", "pink_JJ", "frilly_JJ",
"knickers_NNS", "be_VBP", "a_DT", "perfect_JJ", "place_NN", "to_TO",
"hide_VB", "and_CC", "grandpa_NNP", "'s_POS", "woolly_JJ", "long_JJ",
"john_NNS", "make_VBP", "a_DT", "super-whizzy_JJ", "slide_NN",
"in_IN", "daring_JJ", "competition_NNS", "hold_VBN", "up_RP",
"by_IN", "just_RB", "one_CD", "peg_VB", "they_PRP", "count_VBP",
"how_WRB", "many_JJ", "alien_NNS", "can_MD", "squeeze_VB", "into_IN",
"each_DT", "leg_NN", "they_PRP", "wear_VBP", "pants_NNS", "on_IN",
"they_PRP$", "foot_NNS", "and_CC", "head_NNS", "and_CC", "other_JJ",
"silly_JJ", "place_NNS", "they_PRP", "fly_VBP", "pants_NNS",
"from_IN", "they_PRP$", "spaceship_NNS", "and_CC", "hold_VB",
"upside-down-pant_NNP", "races_NN", "as_IN", "they_PRP", "go_VBP",
"zing_VBG", "through_IN", "the_DT", "air_NN", "it_PRP", "really_RB",
"be_VBZ", "pants-tastic_JJ", "what_WDT", "fun_NN", "the_DT",
"alien_NNS", "can_MD", "have_VB", "with_IN", "pingy_NN", "pants_NNS",
"elastic_JJ", "it_PRP", "be_VBZ", "not_RB", "you_PRP$", "neighbour_NN"
)), class = "data.frame", row.names = c(NA, 200L))

data_frame_b

structure(list(col1 = c("the_DT", "and_CC", "a_DT", "i_PRP",
"to_TO", "he_PRP", "be_VBD", "say_VBD", "it_PRP", "you_PRP",
"be_VBZ", "not_RB", "of_IN", "in_IN", "she_PRP", "be_VBP", "they_PRP",
"on_IN", "he_PRP$", "for_IN", "but_CC", "with_IN", "have_VBD",
"at_IN", "'s_POS", "she_PRP$", "be_VB", "what_WP", "my_PRP$",
"we_PRP", "that_DT", "as_IN", "that_IN", "do_VBP", "can_MD",
"henry_NNP", "would_MD", "then_RB", "this_DT", "all_DT", "will_MD",
"up_RP", "no_DT", "have_VBP", "one_CD", "very_RB", "so_RB", "could_MD",
"when_WRB", "there_EX"), n = c(16254L, 10381L, 8936L, 7168L,
6907L, 5393L, 5014L, 4554L, 4460L, 4431L, 4008L, 3941L, 3746L,
3594L, 3301L, 2691L, 2686L, 2522L, 2513L, 2021L, 1984L, 1859L,
1732L, 1701L, 1667L, 1659L, 1396L, 1376L, 1371L, 1363L, 1277L,
1260L, 1228L, 1201L, 1197L, 1190L, 1166L, 1162L, 1152L, 1125L,
1119L, 1111L, 1039L, 1028L, 985L, 934L, 917L, 905L, 894L, 886L
)), class = "data.frame", row.names = c(NA, 50L))

Thanks!

technocrat · March 22, 2020, 8:08pm

Can you post a reprex with a representative results object?

technocrat · March 22, 2020, 8:52pm

Not sure I'm following your logic. Is the output what you expect?

suppressPackageStartupMessages(library(dplyr)) 

a <- structure(list(Text = c("Aliens", "love", "underpants", "Of",
"every", "shape", "and", "size", "But", "there", "are", "no",
"underpants", "in", "space", "So", "here", "'s", "the", "big",
"surprise", "When", "aliens", "fly", "down", "to", "Earth", "They",
"do", "n't", "come", "to", "meet", "YOU", "They", "simply", "want",
"your", "underpants", "I", "'ll", "bet", "you", "never", "knew",
"Their", "spaceships", "'s", "radar", "bleeps", "and", "blinks",
"The", "moment", "that", "it", "sees", "A", "washing", "line",
"of", "underpants", "All", "flapping", "in", "the", "breeze",
"They", "land", "in", "your", "back", "garden", "Though", "they",
"have", "n't", "been", "invited", "Oooooh", "UNDERPANTS", "they",
"chant", "And", "dance", "around", "delighted", "They", "like",
"them", "red", "they", "like", "them", "green", "Or", "orange",
"like", "satsumas", "But", "best", "of", "all", "they", "love",
"the", "sight", "Of", "Granny", "'s", "spotted", "bloomers",
"Mum", "'s", "pink", "frilly", "knickers", "Are", "a", "perfect",
"place", "to", "hide", "And", "Grandpa", "'s", "woolly", "long",
"johns", "Make", "a", "super-whizzy", "slide", "In", "daring",
"competitions", "Held", "up", "by", "just", "one", "peg", "They",
"count", "how", "many", "aliens", "Can", "squeeze", "into", "each",
"leg", "They", "wear", "pants", "on", "their", "feet", "and",
"heads", "And", "other", "silly", "places", "They", "fly", "pants",
"from", "their", "spaceships", "and", "Hold", "Upside-Down-Pant",
"Races", "As", "they", "go", "zinging", "through", "the", "air",
"It", "really", "is", "pants-tastic", "What", "fun", "the", "aliens",
"can", "have", "With", "pingy", "pants", "elastic", "It", "'s",
"not", "your", "neighbour"), col2 = c("alien_NNS", "love_VBP",
"underpants_NNS", "of_IN", "every_DT", "shape_NN", "and_CC",
"size_NN", "but_CC", "there_EX", "be_VBP", "no_DT", "underpants_NNS",
"in_IN", "space_NN", "so_RB", "here_RB", "be_VBZ", "the_DT",
"big_JJ", "surprise_NN", "when_WRB", "alien_NNS", "fly_VBP",
"down_RB", "to_TO", "earth_NNP", "they_PRP", "do_VBP", "not_RB",
"come_VB", "to_TO", "meet_VB", "you_PRP", "they_PRP", "simply_RB",
"want_VB", "you_PRP$", "underpants_NNS", "i_PRP", "will_MD",
"bet_VB", "you_PRP", "never_RB", "know_VBD", "they_PRP$", "spaceship_NNS",
"'s_POS", "radar_NN", "bleep_NNS", "and_CC", "blink_VBZ", "the_DT",
"moment_NN", "that_IN", "it_PRP", "see_VBZ", "a_DT", "wash_VBG",
"line_NN", "of_IN", "underpants_NNS", "all_DT", "flap_VBG", "in_IN",
"the_DT", "breeze_NN", "they_PRP", "land_VBP", "in_IN", "you_PRP$",
"back_JJ", "garden_NN", "though_IN", "they_PRP", "have_VBP",
"not_RB", "be_VBN", "invite_VBN", "oooooh_NNP", "underpants_NNP",
"they_PRP", "chant_VBP", "and_CC", "dance_NN", "around_RB", "delighted_JJ",
"they_PRP", "like_VBP", "they_PRP", "red_JJ", "they_PRP", "like_VBP",
"they_PRP", "green_JJ", "or_CC", "orange_NN", "like_IN", "satsuma_NNS",
"but_CC", "best_JJS", "of_IN", "all_DT", "they_PRP", "love_VBP",
"the_DT", "sight_NN", "of_IN", "granny_NNP", "'s_POS", "spotted_JJ",
"bloomers_NNS", "mum_NNP", "'s_POS", "pink_JJ", "frilly_JJ",
"knickers_NNS", "be_VBP", "a_DT", "perfect_JJ", "place_NN", "to_TO",
"hide_VB", "and_CC", "grandpa_NNP", "'s_POS", "woolly_JJ", "long_JJ",
"john_NNS", "make_VBP", "a_DT", "super-whizzy_JJ", "slide_NN",
"in_IN", "daring_JJ", "competition_NNS", "hold_VBN", "up_RP",
"by_IN", "just_RB", "one_CD", "peg_VB", "they_PRP", "count_VBP",
"how_WRB", "many_JJ", "alien_NNS", "can_MD", "squeeze_VB", "into_IN",
"each_DT", "leg_NN", "they_PRP", "wear_VBP", "pants_NNS", "on_IN",
"they_PRP$", "foot_NNS", "and_CC", "head_NNS", "and_CC", "other_JJ",
"silly_JJ", "place_NNS", "they_PRP", "fly_VBP", "pants_NNS",
"from_IN", "they_PRP$", "spaceship_NNS", "and_CC", "hold_VB",
"upside-down-pant_NNP", "races_NN", "as_IN", "they_PRP", "go_VBP",
"zing_VBG", "through_IN", "the_DT", "air_NN", "it_PRP", "really_RB",
"be_VBZ", "pants-tastic_JJ", "what_WDT", "fun_NN", "the_DT",
"alien_NNS", "can_MD", "have_VB", "with_IN", "pingy_NN", "pants_NNS",
"elastic_JJ", "it_PRP", "be_VBZ", "not_RB", "you_PRP$", "neighbour_NN"
)), class = "data.frame", row.names = c(NA, 200L))

# Text is superfulous and can be reconstructed from col2 if needed 
a %>% select(-Text) -> a

b <- structure(list(col1 = c("the_DT", "and_CC", "a_DT", "i_PRP",
"to_TO", "he_PRP", "be_VBD", "say_VBD", "it_PRP", "you_PRP",
"be_VBZ", "not_RB", "of_IN", "in_IN", "she_PRP", "be_VBP", "they_PRP",
"on_IN", "he_PRP$", "for_IN", "but_CC", "with_IN", "have_VBD",
"at_IN", "'s_POS", "she_PRP$", "be_VB", "what_WP", "my_PRP$",
"we_PRP", "that_DT", "as_IN", "that_IN", "do_VBP", "can_MD",
"henry_NNP", "would_MD", "then_RB", "this_DT", "all_DT", "will_MD",
"up_RP", "no_DT", "have_VBP", "one_CD", "very_RB", "so_RB", "could_MD",
"when_WRB", "there_EX"), n = c(16254L, 10381L, 8936L, 7168L,
6907L, 5393L, 5014L, 4554L, 4460L, 4431L, 4008L, 3941L, 3746L,
3594L, 3301L, 2691L, 2686L, 2522L, 2513L, 2021L, 1984L, 1859L,
1732L, 1701L, 1667L, 1659L, 1396L, 1376L, 1371L, 1363L, 1277L,
1260L, 1228L, 1201L, 1197L, 1190L, 1166L, 1162L, 1152L, 1125L,
1119L, 1111L, 1039L, 1028L, 985L, 934L, 917L, 905L, 894L, 886L
)), class = "data.frame", row.names = c(NA, 50L))

# create combined object with only tokens common to both
# first rename columns
a %>% rename(x = col2) -> a
b %>% rename(x = col1) -> b
# c has all the 1 results
inner_join(a,b, by = "x") -> c
# so note
c %>% mutate(rf_ = 1) -> c
# d has all the 0 results
anti_join(a,b, by = "x") -> d
# so note (rf_ instead of rf, because rf is a function in stats, always loaded)
d %>%  mutate(rf_ = 0) -> d

c %>% select(rf_) -> rf_1
d %>% select(rf_) -> rf_2
rbind(rf_1,rf_2) -> rf_df
mean(rf_df$rf_)
#> [1] 0.39

^{Created on 2020-03-22 by the reprex package (v0.3.0)}

nilspadraig · March 22, 2020, 9:26pm

Thank you technocrat, but no, my bad... I think I haven't given you a representative data frame at all. I'm trying to do that with reprex now.

What I'm trying to do is to reproduce in R what is explained here very clearly:

Let me see if I can try and reproduce a representative sample.
Thanks again in the meanwhile!

technocrat · March 22, 2020, 10:07pm

There may be a function in the qdap package to do this, but I'll need to install it on my Ubuntu box due to convoluted Java issues with Catalina

nirgrahamuk · March 23, 2020, 12:49am

I had a go at implementing this. I generated Ipsum Lorem example text 3000 words, and ran my basic algorithm on it, timed about 24 seconds on my laptop. I then tried running on 4cores of said laptop, and got speed of about 8 seconds (or 3x faster). Have a look !

Note: I started with quite a tidyverse heavy approach, with map_dfr building a tibble, grouping it, max count by grouping it and summarising it, but I found there was significant overhead in that, and it was better to use the base r aggregate() function for my slowest part of the code so thats why that is there.

library(purrr) #used for iteration
library(tibble)
library(dplyr)
library(tictoc) # only used for benchmarking the time taken for the code to run


#devtools::install_github("gadenbuie/lorem")
# use lorem ipsum words for example text
set.seed(42)
ordered_words_of_text <- lorem::ipsum_words(3000,collapse = FALSE)


num_words <- length(ordered_words_of_text)
unique_words <- length(unique(ordered_words_of_text))

split_into <- 100

split_length <- num_words / split_into



word_positions_list <- map(
  unique(ordered_words_of_text),
  ~ which(ordered_words_of_text == .)
)

new_vec <- function(vlength, vtrue_pos_list) {
  nv <- vector(length = vlength)
  nv[vtrue_pos_list] <- TRUE
  nv
}

realised_word_pos_list <- map(
  word_positions_list,
  ~ new_vec(num_words, .)
)



tictoc::tic(msg = "start sequential algorithm")
avg_red_freq <- function(wordvec, num_words, split_length, split_into) {
  map_dbl(
    0:(split_length - 1),
    ~ aggregate(wordvec, by = list((ceiling((1:num_words + .) / split_length) %% split_into) + 1), max) %>%
      colSums() %>%
      tail(n = 1)
  ) %>% sum() / split_length
}

arf_vec <- map_dbl(
  realised_word_pos_list,
  ~ avg_red_freq(
    wordvec = .,
    num_words = num_words,
    split_length = split_length,
    split_into = split_into
  )
)

res1 <- tibble(
  word = unique(ordered_words_of_text),
  arf = arf_vec
) %>% arrange(desc(arf))

tictoc::toc()

## do it in parallel
library(doParallel)
library(foreach)

tictoc::tic(msg = "start 4-core parallel algorithm")
cl <- makeCluster(4)
registerDoParallel(cl)
par_res <- foreach(i = 1:length(realised_word_pos_list), .packages = "purrr") %dopar% {
  avg_red_freq(
    wordvec = realised_word_pos_list[[i]],
    num_words = num_words,
    split_length = split_length,
    split_into = split_into
  )
}

res2 <- tibble(
  word = unique(ordered_words_of_text),
  arf = par_res %>% unlist()
) %>% arrange(desc(arf))

tictoc::toc()

all.equal(res1,res2) # prove the parallel version gave same result, just faster

nilspadraig · March 23, 2020, 10:35am

That's quite impressive!
The only thing is: split_into is supposed to be a different value for every PoS for which the avg_red_freq is calculated.
Do I need to include a variable for it or would this mess up the following code (i.e. for the code to work as it does I need to keep a fixed split_into value)?

nirgrahamuk · March 23, 2020, 10:45am

Can you say more about this ? it seems counterintuitive, assuming that you are analysing a single text, and want comparative metrics for the words in it. Presumably with different splits, the measures will lose comparative meaning.... ?
How would you calculate the split_into to use per word ?

nilspadraig · March 23, 2020, 11:38am

I am actually analysing multiple texts put together in a single corpus, where raw frequencies and simple adjusted frequencies can be misleading. The ARF should instead represent the frequency which a lemma+PoS would have if it were distributed homogeneously in the dataset. I know it's operationally convoluted, but this is the only way to calculate it. An example:

Our data frame has 327623 tokens.
The lemma 'house' (as a noun, hence the need for lemma-PoS, since it could be a verb, for example) has a raw frequency in the whole data frame of 946.
To calculate its ARF we first need to calculate all its reduced frequencies. To do that we split the data frame into 946 chunks, so that we get chunks of length 327623 / 946 (we will need to round this up, or down, obviously). Then I count how many chunks contain 'house'. It does not matter if there is more than 1 in a chunk, it will still count as 1 occurrence (this is the very principle of reduced frequency: if the word is equally distributed in the data frame, then if I split the data frame into 946 chunks I should theoretically find exactly 1 occurrence per chunk, which is clearly almost never the case).
The count of chunks containing 'house' is only one reduced frequency. We need to calculate as many reduce frequencies for 'house' as there are rows in a chunk: every time we split the data frame in chunks, we split it starting from i + 1 token position and each time the reduced frequency will likely differ. We then do the average of all reduced frequencies for 'house' and move to the next lemma+PoS, repeating the process until all lemmas+PoS are done and we have the ARF of all of them.

I currently have:

A dataset with 327024 rows, each corresponding to a token reduced to their lemma-PoS (e.g. playing > play-VERB, played > play-VERB)
A frequency list of lemma-PoS (col1 = the lemma-PoS, e.g. 'house-NOUN', col2 = its raw frequency in the dataset, e.g. 946).

My first code above was trying to:

Take the value (n) in row1, col2 of the frequency list -> split the dataset into n chunks (so each chunk has 347 tokens [rounded up from 346.32452431]).
Count the chunks containing the lemma-PoS in row1, col1, and store the value (reduced frequency) in a vector for later.
Split the data frame again into n chunks, this time starting from the second token in the data frame (i.e.: in the first round, chunk1 included tokens 1: 347, now chunk1 includes tokens 2 : 348 [and token 1 is moved at the bottom of the data frame, included in chunk946). Calculate reduced frequency. Repeat 347 times.
ARF = mean of all 347 reduced frequencies.

Let me know if there is anything more I could explain.

The code I gave you at the beginning works, but it takes circa 1 minute to calculate 1 ARF, so that to calculate the ARF of all lemma-PoS I would need roughly 12 days.

nirgrahamuk · March 23, 2020, 11:56am

Maybe you could run your code that you began with against my example data, to compare for yourself the relative speed of calculation. And if the metrics agree.
The issue I have is that the decision to divide the total text into n chunks with n based on the raw frequency seems arbitrary and not justified by any argument or theory that you presented ... Can you fill in this gap for me.
I don't think that for a given text (your corpus) you could only get useful info about any two words respective arfs if they were calculated in a like for like way and the differing chunk choice would seem to invalidate that for me. How confident are you that chunking by the raw frequency is merited? The initial document specifying the arf calculation that you linked to does not hint at this from my reading of it.
Sorry for any misunderstanding

nilspadraig · March 23, 2020, 12:07pm

Of course, I understand your scepticism! I can point at the relevant literature arguing for this approach:

See in particular 1) on p. 4, the authors explain why is arf used (they surely make a much better job at explaining the principle than me even trying to!).

I hope this help!