how can fit s curve to data?

saso_008 · March 18, 2020, 11:47am

i have a tibble and i want to fit s curve to it . how can do it?

tb <- tt <- tibble::tribble(
    ~ value,
    2,
    5,
    18,
    28,
    43,
    61,
    95,
    139,
    245,
    388,
    593,
    978,
    1501,
    2336,
    2922,
    3513,
    4747,
    5823,
    6566,
    7161,
    8042,
    9000,
    10075,
    11364,
    12729,
    13938,
    14991,
    16169
)

nirgrahamuk · March 18, 2020, 12:35pm


tb2 <- mutate(tb,
              rownum = row_number(),
              value2 = value/max(value))

model <- glm(value2 ~ rownum,family=binomial(link='logit'),data=tb2)

summary(model)

fitted.results <- predict(model,
                          newdata=tb2,type='response')

tb3 <- cbind(tb2,fitted.results)

tb4 <- mutate(tb3,
              rescaled_prediction = fitted.results * max(value))

plot(tb4$rownum,tb4$value,type="l",col="blue")
lines(tb4$rownum,tb4$rescaled_prediction,col="red")

saso_008 · March 18, 2020, 3:03pm

thanks @nirgrahamuk
I mean it's , the data set is first section of S curve and how can continue line S curve to fit that.(continue blue line)

technocrat · March 18, 2020, 4:24pm

Thanks for the partial reprex, it's missing tt however. The best way to get the data would be

dput(tb)

and cut and paste the output, so folk don't have to extract ~value manually.

suppressPackageStartupMessages(library(ggplot2)) 
dat <- c(2,5,18,28,43,61,95,139,245,388,593,978,1501,2336,2922,3513,4747,5823,6566,7161,8042,9000,10075,11364,12729,13938,14991,16169)
pos <- seq(1:28)
df <- as.data.frame(cbind(pos,dat))
p <- ggplot(df,aes(pos,dat))
p + geom_line()

^{Created on 2020-03-18 by the reprex package (v0.3.0)}

gives a line plot of the data against sequence number.

technocrat · March 18, 2020, 4:48pm

Good on @nirgrahamuk for the log likelihood enhancement

nirgrahamuk · March 18, 2020, 5:25pm

Hi Saso, not sure I understand your question.
The blue line is your actual data, and can''t be extended in a non-trivial way per say. (you could record more samples ? you could use the red line after the blue line stops ?)
The red line is the fitted line, and so can be extended by giving it more row_numbers... like this.


tb2 <- mutate(tb,
              rownum = row_number(),
              value2 = value/max(value))

model <- glm(value2 ~ rownum,family=binomial(link='logit'),data=tb2)

summary(model)

mtr <- max(tb2$rownum) + 1
tb2_extended <- union_all(tb2,
                          data.frame(value=rep(NA_real_,10),
                                     rownum=mtr:(mtr+9),
                                     value2=rep(NA_real_,10)))
fitted.results <- predict(model,
                          newdata=tb2_extended,type='response')

tb3 <- cbind(tb2_extended,fitted.results)

tb4 <- mutate(tb3,
              rescaled_prediction = fitted.results * max(value,na.rm = TRUE))

plot(tb4$rownum,tb4$value,type="l",col="blue")
lines(tb4$rownum,tb4$rescaled_prediction,col="red")

saso_008 · March 18, 2020, 5:51pm

it true, I haven't any other data but if we change the limit y to (0, 30000), can draw continue blue line the same S curve?

nirgrahamuk · March 18, 2020, 6:03pm

Could you give any background to the context, what does the data represent, what would the s curve then be used for ?
It might be that understanding this informs a modeling decision regarding how best to proceed

saso_008 · March 18, 2020, 6:42pm

my data is confirmed corona and this data have S shape. know i want to drawing it according to this data.

technocrat · March 18, 2020, 7:06pm

Local polynomial regression fitting appears to work well

library(ggplot2)
suppressPackageStartupMessages(library(ggplot2)) 
dat <- c(2,5,18,28,43,61,95,139,245,388,593,978,1501,2336,2922,3513,4747,5823,6566,7161,8042,9000,10075,11364,12729,13938,14991,16169)
pos <- seq(1:28)
df <- as.data.frame(cbind(pos,dat))
p <- ggplot(df,aes(pos,dat))
p + geom_line() + geom_point() + geom_smooth(method = "loess")
#> `geom_smooth()` using formula 'y ~ x'

^{Created on 2020-03-18 by the reprex package (v0.3.0)}

You can also look at ARIMA models with the forecast package

saso_008 · March 18, 2020, 7:26pm

thanks @technocrat but i want continue line and no until fitting model to data.

technocrat · March 18, 2020, 7:36pm

Continuing the line requires a choice of model to fit the existing data and project it as a function of time. What model will you use?

saso_008 · March 18, 2020, 8:08pm

@technocrat i use negative binomial to fit data and i know my data in future tend to S shape

technocrat · March 18, 2020, 9:30pm

OK, so have you tried @nirgrahamuk's suggestion of using the predict() function to extrapolate further data points?

saso_008 · March 18, 2020, 10:20pm

Yes, @technocrat but i use glm.nb function for modeling data and also spline function to get s curve and don't get result. Because we fit glm.nb for fitting model to data and dats after the time change to s curve

technocrat · March 18, 2020, 10:25pm

Can you reprex your code?

saso_008 · March 18, 2020, 10:44pm

@technocrat how can do it?

technocrat · March 18, 2020, 11:20pm

To do a reprex, please see the FAQ: What's a reproducible example (`reprex`) and how do I do one?

You need to provide the code you are using to predict the future values in the series.

nirgrahamuk · March 19, 2020, 1:04am

I think that the data is insufficient to do more than simulate very vague predictions of possible s-curves.
a lot more data would be required to pick curves of best fit.

What I'm about to present, is probably flawed, and I'd appreciate help with corrections. I figure that the data presented is very very early in any possible s curve. i.e. theres no sight yet of the inflexion point, so a huge range of possible s-curves can be fit. It seems to me that the fact that only the early portion of an s - curve has been provided can be achieved by dividing linearly the sample values, by some factor that represents how small of a portion of an s-curve has been presented. Then we can try to fit scurves on the early data points, and see the sorts of curves that can fit well for that.

library(tidyverse)
library(purrr) 
tb <- tibble::tribble(
  ~ value,
  2,
  5,
  18,
  28,
  43,
  61,
  95,
  139,
  245,
  388,
  593,
  978,
  1501,
  2336,
  2922,
  3513,
  4747,
  5823,
  6566,
  7161,
  8042,
  9000,
  10075,
  11364,
  12729,
  13938,
  14991,
  16169
) %>% mutate(
  rownum = row_number()
)


opt_func <- function(param_to_opt) {
  tb2 <- mutate(tb,
    value2 = value / (param_to_opt * max(value))
  )

  model <- glm(value2 ~ rownum, family = binomial(link = "logit"),weights=log(tb2$rownum), data = tb2)
  
 
  return( sqrt(mean(model$residuals^2)))
}

ser <- 1:1000 * 10000
rmse <- map_dbl(ser , ~opt_func(.))
min_index <- which(rmse==min(rmse))
ser[[min_index]]

plot(x = ser,y=rmse)


rmse[rmse> 0.570192 -0.000001 & rmse <0.570192 +0.000001]
#only 2 values
rmse[rmse> 0.570192 -0.000002 & rmse <0.570192 +0.000002]
#477 values
rmse[rmse> 0.570192 -0.0000015 & rmse <0.570192 +0.0000015]
#43 values ' so will plot 43 'most likey' s curves from the 1000 trialled.


parms_to_plot <- which(rmse> 0.570192 -0.0000015 & rmse <0.570192 +0.0000015)

extrude_length <- 150

new_prediction <- function(chosen_parm)
{
tb2 <- mutate(tb,
              rownum = row_number(),
              value2 = value / (chosen_parm * max(value))
)


model <- glm(value2 ~ rownum, family = binomial(link = "logit"),weights = log(tb2$rownum), data = tb2)
mtr <- max(tb2$rownum) + 1



tb2_extended <- union_all(tb2,
                          data.frame(value=rep(NA_real_,extrude_length),
                                     rownum=mtr:(mtr+extrude_length -1),
                                     value2=rep(NA_real_,extrude_length)))
fitted.results <- predict(model,
                          newdata=tb2_extended,type='response')

tb3 <- cbind(tb2_extended,fitted.results)

tb4 <- mutate(tb3,
              rescaled_prediction = fitted.results * chosen_parm *max(value,na.rm = TRUE))

select(tb4,
       rescaled_prediction)
}

extruded_plots <- map_dfc(ser[parms_to_plot],
        ~new_prediction(.))

mtr <- max(tb$rownum) + 1
tb2_extended <- union_all(tb,
                          data.frame(value=rep(NA_real_,extrude_length),
                                     rownum=mtr:(mtr+extrude_length -1)))
                                     
data_to_plot<-bind_cols(tb2_extended,extruded_plots)

library(plotly)

p<- plot_ly(data=data_to_plot,
            type="scatter",
            mode="lines",
            x=~rownum,
            y=~value,
            line= list(
              color="#0000FFFF"))

additional_line_names <- setdiff(names(data_to_plot),
                                 c("value","rownum")) 

add_next_trace <- function(name)
{
  p<<-p %>% add_trace(y=as.formula(paste0("~",name)),
               line= list(
                 color="#FF000044",
                 dash = 'dash')
             ) 
}
walk(additional_line_names,
     add_next_trace
     )


p_close <- p %>% layout(xaxis = list(range = c(1, 40)),
                        yaxis = list(range = c(0,60000)))
pfull <- p

in the above i have plot() of the root mean square error , rmse that looks like

I then try to find some number of values that are towards the minima of that.
I settle on bounds which represent the 43 smallest values.
and I plot them.
there is a zoomed in view p_close

and a zoomed out view p_full

As can be seen here, there is a range from about 2% of earth population to over 100% , worrying :-/

I wouldn't trust these results to give any insight... but it was interesting to think about abstractly.

technocrat · March 19, 2020, 6:13am

@nirgrahamuk Thank you for the virtuosity: it's exactly what is called for in place of the convergence to a predetermined curve asked for. I learned a lot.