Seeking suggestions on nls parameters to improve fit

I have some data of GrowthRate modelled against Day:

exdata |> 
  sample_n(200000) |> 
  ggplot(aes(x = DAY, y = GrowthRate)) +
  geom_point(alpha = 0.01) +
  theme_minimal()

I fit a power model to this using nls():

mod.nls <- nls(GrowthRate ~ i + I(DAY^power), data = exdata, start = list(power = 1, i = 0))

The predictions look OK:

exdata$PredictionsNLS <- predict(mod.nls)

set.seed(123)
exdata |> 
  sample_n(200000) |> 
  ggplot(aes(x = DAY, y = GrowthRate)) +
  geom_point(alpha = 0.01, color = 'grey') +
  geom_line(aes(x = DAY, y = PredictionsNLS), color = 'steelblue') +
  theme_minimal()

At first glance this looked fine. But when I was predicting between days 30 and 365:

set.seed(123)
exdata |> 
  sample_n(200000) |> 
  filter(DAY >= 30 & DAY <= 365) |> 
  ggplot(aes(x = DAY, y = GrowthRate)) +
  geom_point(alpha = 0.01, color = 'grey') +
  geom_line(aes(x = DAY, y = PredictionsNLS), color = 'steelblue') +
  theme_minimal()

My predicted line is way above actual. Ideally it would cut right through the middle. Are there any adjustments I could make to my fit to get a better line through this range of predictor variable days?

Can you show us the estimated coefficients?

One possibility is that you want to have a multiplicative coefficient in front of the power term.

Also, are you missing an intercept?

Hi, here's the summary:

mod.nls |> summary()

Formula: GrowthRate ~ i + I(DAY^power)

Parameters:
        Estimate Std. Error t value Pr(>|t|)    
power -1.783e-02  3.964e-05  -449.8   <2e-16 ***
i      1.051e-01  2.130e-04   493.6   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.03543 on 1064597 degrees of freedom

Number of iterations to convergence: 13 
Achieved convergence tolerance: 5.383e-06

Your suggestion. I tried this:

mod.nls.c <- nls(GrowthRate ~ i + c * I(DAY^power), data = exdata, start = list(power = 1, i = 0, c = 1))

Which resulted in error:

Error in nls(GrowthRate ~ i + c * I(DAY^power), data = exdata, start = list(power = 1, :
number of iterations exceeded maximum of 50

Also tried increasing max iter:

mod.nls.c <- nls(GrowthRate ~ i + c * I(DAY^power), 
+                  data = exdata, 
+                  start = list(power = 1, i = 0, c = 1),
+                  control = nls.control(maxiter = 100)
+                  )
Error in nls(GrowthRate ~ i + c * I(DAY^power), data = exdata, start = list(power = 1,  : 
  step factor 0.000488281 reduced below 'minFactor' of 0.000976562

It's hard to see the issue clearly because GrowthRate spans orders of magnitude. I think it would be easier to work with log(GrowthRate). Then, plotting the prediction, it will be easier to see what's happening.

Just to check my understanding (sorry if the answer is obvious), what's the value of i? I assume it's not \sqrt{-1}.

The nls returned it as just 1.051e-01. I'm not sure I understand! i is an intercept

HI, I tried this just now:

mod.nls.c <- nls(log(GrowthRate) ~ i + c * I(DAY^power), 
                 data = exdata, 
                 start = list(power = 1, i = 0, c = 1),
                 control = nls.control(maxiter = 200),
                 trace = T)

But got error:

Error in nls(log(GrowthRate) ~ i + c * I(DAY^power), data = exdata, start = list(power = 1, :
step factor 0.000488281 reduced below 'minFactor' of 0.000976562

:confused:

That's not the value of i, it's the coefficient on i. I assume that the value of i is 1.0?

You might try setting the starting value for the intercept to the mean of the data.

Tried this starting value of i with mean(exdata$GrowthRate) being equal to 1.005258.

mod.nls.c <- nls(GrowthRate ~ i + c * I(DAY^power), 
                 data = exdata, 
                 start = list(power = 1, i = 1.005258, c = 1),
                 control = nls.control(maxiter = 200),
                 trace = T)

Same error message :frowning:

Error in nls(GrowthRate ~ i + c * I(DAY^power), data = exdata, start = list(power = 1, :
step factor 0.000488281 reduced below 'minFactor' of 0.000976562

Is there nay other info I could provide here? Am really curious if I can make a coefficient fit.

Looks to me like you're doing everything right. (How's that for unhelpful?)

Can you post a subset of your data for people to play with, using dput()?

Yes, let me do that. I'll need some time here, I'll post a sanitized subset of data later. Thank you so much for your help thus far

Experimenting with the start and control parameters is not the direction I'd go.

If you plot log(GrowthRate) vs. DAY, you'll immediately learn a lot about the shape of the relationship.

Hi Arthur, here's a plot of log(GrowthRate):

set.seed(123)
exdata |> 
  sample_n(200000) |> 
  ggplot(aes(x = DAY, y = log(GrowthRate))) +
  geom_point(alpha = 0.01) +
  theme_minimal()

I tried using log(GrowthRate) as a target variable but I got the same error. Any suggestions welcome!

Following up on @arthur.t 's suggestion, if you don't have i in your model then the plot plot you just showed us should be roughly linear. Since that's wildly untrue, it suggests that the functional form in your model isn't right.

Out of curiosity, what are you modelling the growth rate of?

I was able to get nls with a coefficient per your initial suggestion to complete with these parameters:

mod.nls.c <- nls(GrowthRate ~ i + c * I(DAY^power), 
                 data = exdata, 
                 start = list(power = -0.01, i = 1.005258, c = 2),
                 control = nls.control(maxiter = 200))


mod.nls.c |> summary()

Formula: GrowthRate ~ i + c * I(DAY^power)

Parameters:
        Estimate Std. Error t value Pr(>|t|)
power -6.285e-01  9.733e-04  -645.7   <2e-16
i      9.898e-01  6.423e-05 15410.8   <2e-16
c      4.587e-01  7.141e-04   642.3   <2e-16
         
power ***
i     ***
c     ***
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.03025 on 1064596 degrees of freedom

Number of iterations to convergence: 25 
Achieved convergence tolerance: 7.798e-06

But the same issue seems to persist, here's the plot and zoomed in plot:

exdata$PredictionsNLSC <- predict(mod.nls.c)
set.seed(123)
exdata |> 
  sample_n(200000) |> 
  ggplot(aes(x = DAY, y = GrowthRate)) +
  geom_point(alpha = 0.01, color = 'grey') +
  geom_line(aes(x = DAY, y = PredictionsNLSC), color = 'steelblue') +
  theme_minimal()

And zoomed in between days 30 and 365:

set.seed(123)
exdata |> 
  sample_n(200000) |> 
  filter(DAY >= 30 & DAY <= 365) |> 
  ggplot(aes(x = DAY, y = GrowthRate)) +
  geom_point(alpha = 0.01, color = 'grey') +
  geom_line(aes(x = DAY, y = PredictionsNLSC), color = 'steelblue') +
  theme_minimal()

Tried a coefficient. Anything else I could try?! Will share some data later too.

This is day to day revenue growth for an app.

I think I'm starting to see. Growth isn't even over time. Probably not surprising for an app. You might think about a more flexible functional form with respect to DAY. Exactly how to do that might depend some on the purpose of the estimate.

There might be something weird going on here where the growth rate seems to level off to 1 instead of 0, and that's why the log doesn't accomplish what it's supposed to. Perhaps the growth rate needs to be defined differently so that zero growth produce quantitatively growth = 0 and not 1.

Another desperate option is to keep taking logs of logs until it's approximately linear! :innocent: