Hi Nate! I don't have a real answer for you, but maybe I can offer some additional thoughts.
I don't think I'm necessarily too surprised by this. If the data was all in one place before, and you were already running in parallel, then sending the data up to the cloud could be a pretty expensive operation. How much data are we talking here? And are you saying you ran it on 1 GCP instance, or across multiple (from your image I'd guess 1)? Did you have to send the data up to the GCP from your local R session, or was there some kind of upload process you did in advance? It's really hard to debug without details of the exact setup, since there are so many ways to do this.
We also tried adding more RAM to the same VM (up to 256 GB from 128 GB) and the model slowed down from 5 hours to 7 hours.
I think I am a bit surprised by this, but I don't have any good recommendations for you at this point. Unless you were hitting RAM limits, I don't think that increasing the RAM would have done much of anything. The main benefits you were getting from your in house VM was being able to shard across 12 cores. I'm surprised it slowed down so much.
The modeling code went from 5 hours to 14 hours. It was 14 hours on both the same 128 GB VM and a different in-house 256 GB VM.
Is there any way that VM was also working on something else? Was this the only task it was working on? That's a pretty big difference.
Honestly, I think the only advice I can give you for now is to attempt to reproduce this with future.apply::future_lapply() or with just pure futures from the future package. Right now its hard to know where the problem is. It could be furrr, but more likely something is happening in the underlying infrastructure.
The only other thing I see is that some of those configurations ran multisession, and some ran multicore. You might try to run the 17 hour one on multisession and see if you get different results. There is a decent chance that could have an effect.
Here is some information on multicore (forking) vs multisession. You might be in a situation where you are making repeated copies of the shared memory, and that is what is taking so long? Not sure.
Forking an R process can be faster than working with a separate R session running in the background. One reason is that the overhead of exporting large globals to the background session can be greater than when forking, and therefore shared memory, is used. On the other hand, the shared memory is read only , meaning any modifications to shared objects by one of the forked processes (“workers”) will cause a copy by the operating system. This can also happen when the R garbage collector runs in one of the forked processes. - https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html