Map over a list that gets updated?

brodriguesco · November 20, 2017, 12:06pm

Hi Rstudio Community,

I do not know if what I want to do is possible without using loops: suppose I have a vector of length 1000. I want to use head() on this vector and store the result in the first element of a list, list_result. Then, I want to update (or replace) my starting vector with tail(vector, -5) and do the step described before again, but storing the result in the second element of the list. This means that at the end, list_result will be a list with 200 elements, each containing a vector of length 5. I don't think there's a way of doing that using the tools that purrr provides, since the list needs to be modified at each iteration.

EDIT: I'm thinking of something like a map() function that would take a list, and a function that returns a list containing the computation you're interested in (in this case the result of head) and the updated list as the second element (in this case, the result of tail) and would then continue doing that until the list is empty and finally return a list containing only the result of the computations. Does that make sense?

nutterb · November 20, 2017, 12:25pm

You could do the following, but it amounts to an obscured for loop and, owing to the constant replacements, is slower than death.


l1 <- vector("list", 1000)
l2 <- vector("list", length(l1) / 5)

library(purrr)
map(seq_along(l2),
    function(i){
      print(i)
      l2[[i]] <<- head(l1, 5)
      l1[[i]] <<- tail(l1, -5)
    })

A better option would be to leave l1 untouched, and you could approach it with this:

l1 <- vector("list", 1000)

library(purrr)
l2 <- map(seq_len(length(l1) / 5),
    function(i){
      l1[1:5 + (i-1) * 5]
    })

But if I'm completely honest, this is the type of situation where I'm not sure this is more clear than a for loop (though the for loop is about 15 times slower than map.

l1 <- vector("list", 1000)
l2 <- vector("list", 200)
for (i in seq_along(l2)){
  l2[[i]] <- l1[1:5 + (i-1) * 5]
}

brodriguesco · November 20, 2017, 1:42pm

Thanks for the suggestions, I didn't think of using the index i as the argument of an anonymous function!

I also thought of using a recursive function:

extract_head = function(vec1, vec2 = NULL){
  
  if(is.null(vec2)){
    vec2 = list(head(vec1, 5))
    vec1 = tail(vec1, -5)
  }
  
  head1 = head(vec1, 5)
  vec2 = list(vec2, head1)
  vec1 = tail(vec1, -5)
  
  if(length(vec1) != 0){
    extract_head(vec1 = vec1, vec2 = vec2)
  } else {
  vec2
  }
}

extract_head(vec1 = seq(1:100))

but I know recursive functions are not fast in R (plus the function above does not work as expected , I think what we see is the call stack, or maybe it does work as expected, since R does not optimize recursive functions!)

EDIT: When running your examples I only get list of NULLs returned

A slight modification of my function returns what I expected:

extract_head = function(vec1, vec2 = NULL){
  
  if(is.null(vec2)){
    vec2 = list(head(vec1, 5))
    vec1 = tail(vec1, -5)
  }
  
  head1 = head(vec1, 5)
  vec2 = purrr::prepend(vec2, list(head1))
  vec1 = tail(vec1, -5)
  
  if(length(vec1) != 0){
    extract_head(vec1 = vec1, vec2 = vec2)
  } else {
  vec2
  }
}

extract_head(vec1 = seq(1:100))

nutterb · November 20, 2017, 1:48pm

You should have gotten a list of NULL, because that was all I ever put in the lists. I was only attempting to show that I could get a list with sublists of the correct length. you could replace l1 with

l1 <- map(1:1000, function(i) 1:5 + (i-1) * 5)

That should give you a clear visual of the output.

brodriguesco · November 20, 2017, 1:53pm

oups, you're right, thanks! I'll thank about how using your solutions for my problem.

Aurele · November 21, 2017, 6:48pm

Can you share more about what the actual data and the actual computations would look like?

brodriguesco · November 21, 2017, 7:02pm

I read in some text with readlines() and every block of 5 lines was a separate paragraph. I wanted to have a list where each element of the list was 5 lines of text (or a paragraph). So the function I shared above did that, but it is not the most efficient way of doing it, since recursive functions in R are quite slow (but since my text file was rather small, it turned out ok). I haven't tried nutterb's solutions yet, but I'm sure they would be more efficient and the way to go with larger text files.

Aurele · November 21, 2017, 7:17pm

Then maybe you'll find this sort of approach to be simpler:

x <- c(
  "Lorem ipsum dolor sit amet,",
  "consectetur adipiscing elit.",
  "Fusce nec quam ut tortor",
  "interdum pulvinar id vitae magna.", 
  "Curabitur commodo consequat arcu et lacinia.", 
  "Proin at diam vitae lectus",
  "dignissim auctor nec dictum lectus.",
  "Fusce venenatis eros congue velit feugiat,", 
  "ac aliquam ipsum gravida.",
  "Cras bibendum malesuada est in tempus.",
  "Suspendisse tincidunt, nisi non",
  "finibus consequat, ex nisl",
  "condimentum orci, et dignissim",
  "neque est vitae nulla."
)
split(x, rep(seq_along(x), each = 5, length.out = length(x)))

# $`1`
# [1] "Lorem ipsum dolor sit amet,"                  "consectetur adipiscing elit."                
# [3] "Fusce nec quam ut tortor"                     "interdum pulvinar id vitae magna."           
# [5] "Curabitur commodo consequat arcu et lacinia."
# 
# $`2`
# [1] "Proin at diam vitae lectus"                 "dignissim auctor nec dictum lectus."       
# [3] "Fusce venenatis eros congue velit feugiat," "ac aliquam ipsum gravida."                 
# [5] "Cras bibendum malesuada est in tempus."    
# 
# $`3`
# [1] "Suspendisse tincidunt, nisi non" "finibus consequat, ex nisl"     
# [3] "condimentum orci, et dignissim"  "neque est vitae nulla."

I had assumed (and probably so had Nutterb) that you needed to do something that involved actually updating the data, but it isn't the case here

brodriguesco · November 21, 2017, 7:20pm

Thank you very much, that is indeed much simpler! I was quite certain that there had to be something much easier. However, while solving my issue, I thought about this question and that's why I asked it in more general terms.

Aurele · November 21, 2017, 7:29pm

This remains an interesting question indeed. Recursivity is risky, and I'd go for a simpler while, especially if the number of hypothetical updates cannot easily be known in advance:

res <- list()
while (length(x)) {
  res <- c(res, list(head(x, 5)))
  x <- tail(x, -5)
}
res

# [[1]]
# [1] "Lorem ipsum dolor sit amet,"                  "consectetur adipiscing elit."                
# [3] "Fusce nec quam ut tortor"                     "interdum pulvinar id vitae magna."           
# [5] "Curabitur commodo consequat arcu et lacinia."
# 
# [[2]]
# [1] "Proin at diam vitae lectus"                 "dignissim auctor nec dictum lectus."       
# [3] "Fusce venenatis eros congue velit feugiat," "ac aliquam ipsum gravida."                 
# [5] "Cras bibendum malesuada est in tempus."    
# 
# [[3]]
# [1] "Suspendisse tincidunt, nisi non" "finibus consequat, ex nisl"     
# [3] "condimentum orci, et dignissim"  "neque est vitae nulla."

brodriguesco · November 21, 2017, 7:35pm

That is also a very nice solution! I guess in terms of performance, this is the best you could get in pure R?

Aurele · November 21, 2017, 7:45pm

It really depends where the performance bottleneck is. Could be reading the file (in which case maybe data.table::fread could help, even though it's meant to read in rectangular datasets). If it's really critical, I'd even pre-process it with a command line utility like $ split -l 5 bigfile.txt then parallelize reading into R with something with mclapply(dir(), read_lines) that removes the need to split... Your mileage may greatly vary