RAM, parallelization, and speed

We’ve been using future_map (from the furrr package) to run some modeling code in parallel. Our timing runs have produced some odd results (details below). We’re not really sure what to try next at this point, so I’ll welcome suggestions.

Initially the modeling code ran in 5 hours on an in-house VM with 128 GB of RAM and 12 cores. We then tried the same data and code on a Google Cloud Platform with a variety of more powerful setups (both additional cores and additional RAM) and every single run took longer. We also tried adding more RAM to the same VM (up to 256 GB from 128 GB) and the model slowed down from 5 hours to 7 hours. Mystified, we downgraded the VM back to 128 GB and the model went back to taking 5 hours.

A few weeks later our modeling slowed down drastically - unable to identify a reason we tested on the same code and data we’d used for timing runs before (we’d tweaked the modeling code a little and we get new data every month, so we wanted to eliminate those as possible reasons for the slow down). The modeling code went from 5 hours to 14 hours. It was 14 hours on both the same 128 GB VM and a different in-house 256 GB VM.

I’m wondering if anyone who knows more about hardware and/or more about future_map can give us some suggestions on what to try next. We’re mystified both by the initial finding that more RAM slowed things down and by the sudden slowdown of the same script on the same machine.

Timing from the initial test runs:

This is surprising results for sure. I can't think of a reason when increasing size of the machine leads to worse results. But in general using parallelization can lead to slower results if you have lots and lots of small operations since then you need a lot of extra bookkeeping to keep everything on track. I had similar situation happen to me once when sequential approach took less time then multisession with 12 cores.

That is all to say that it's difficult to have any recommendation without understanding in detail what you are doing and how.

1 Like

Thanks! I know it's hard to troubleshoot on a problem this vague. Running it sequentially takes around 50 hours. There are definitely gains to going parallel in this case.

We have 180 markets, each of which has between 50 to 300 products for which we're building price elasticity models. We're trying to run the markets in parallel and then within each market we build a separate model for each product.

Hi Nate! I don't have a real answer for you, but maybe I can offer some additional thoughts.

I don't think I'm necessarily too surprised by this. If the data was all in one place before, and you were already running in parallel, then sending the data up to the cloud could be a pretty expensive operation. How much data are we talking here? And are you saying you ran it on 1 GCP instance, or across multiple (from your image I'd guess 1)? Did you have to send the data up to the GCP from your local R session, or was there some kind of upload process you did in advance? It's really hard to debug without details of the exact setup, since there are so many ways to do this.

We also tried adding more RAM to the same VM (up to 256 GB from 128 GB) and the model slowed down from 5 hours to 7 hours.

I think I am a bit surprised by this, but I don't have any good recommendations for you at this point. Unless you were hitting RAM limits, I don't think that increasing the RAM would have done much of anything. The main benefits you were getting from your in house VM was being able to shard across 12 cores. I'm surprised it slowed down so much.

The modeling code went from 5 hours to 14 hours. It was 14 hours on both the same 128 GB VM and a different in-house 256 GB VM.

Is there any way that VM was also working on something else? Was this the only task it was working on? That's a pretty big difference.

Honestly, I think the only advice I can give you for now is to attempt to reproduce this with future.apply::future_lapply() or with just pure futures from the future package. Right now its hard to know where the problem is. It could be furrr, but more likely something is happening in the underlying infrastructure.

The only other thing I see is that some of those configurations ran multisession, and some ran multicore. You might try to run the 17 hour one on multisession and see if you get different results. There is a decent chance that could have an effect.

Here is some information on multicore (forking) vs multisession. You might be in a situation where you are making repeated copies of the shared memory, and that is what is taking so long? Not sure.

Forking an R process can be faster than working with a separate R session running in the background. One reason is that the overhead of exporting large globals to the background session can be greater than when forking, and therefore shared memory, is used. On the other hand, the shared memory is read only , meaning any modifications to shared objects by one of the forked processes (“workers”) will cause a copy by the operating system. This can also happen when the R garbage collector runs in one of the forked processes. - A Future for R: A Comprehensive Overview

2 Likes

Hi Nate,

Although this doesn't address the delay caused by increasing the RAM on the 12 core, it would be helpful to review the lscpu output from each of those environments to compare the clock speed. That may explain some of the performance differences; i.e. the 12 core server may be operating at 3.4ghz for example while the 40/80/160 core instances are at a much lower clock speed.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.