Statistical forecasting solution

As R and analyst newbie I'm searching for the following solutions/formulas/models:

I've an amount of rows like this

   date, value
1: date1, 0
2: date2, 1
3: date3, 5
4: date4, 2
5: date5, 2
6: date6, 0

Value can be from 0 to 5.

The average count (over current 231 rows) of each value is:

0: 47 = 20.35 %
1: 93 = 40.26 %
2: 67 = 29.00 %
3: 22 = 9.52 %
4: 2 = 0.87 %
5: 0 = 0.00 %

Are there any statistic functions (and how to solve it with R) to say "it's time for the next 0?".

E.g. "0" didn't happen now for 9 laps - it "must" happen "now/soon".

It happened one time that 3 zeroes came in 3 laps. And 3 times 2 zeroes in 2 laps.
The longest no-"0" happened 3-4 times with 9 to 11 laps.

I want to use the same logic for the other values (1,2,3) too - maybe combined.

In a 2nd advanced model it should be considered that the report should also say e.g. "0" and "2" are recommended for the next lap.

Any ideas? - Thanks a lot

library(tidyverse)
library(tibble)
# Some random data
# make reproducible
set.seed(42)
my_data1 <- sample(0:5,231, replace = TRUE)
#convert to tibble
my_data1 <- enframe(my_data1)
my_data
#> Error in eval(expr, envir, enclos): object 'my_data' not found
# Visualize it

ggplot(my_data1, aes(value)) + geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


# repeat with different seed

set.seed(137)
my_data2 <- sample(0:5,231, replace = TRUE)
#convert to tibble
my_data2 <- enframe(my_data2)
my_data2
#> # A tibble: 231 x 2
#>     name value
#>    <int> <int>
#>  1     1     2
#>  2     2     1
#>  3     3     4
#>  4     4     5
#>  5     5     2
#>  6     6     4
#>  7     7     2
#>  8     8     2
#>  9     9     3
#> 10    10     1
#> # … with 221 more rows

# Visualize it

ggplot(my_data2, aes(value)) + geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


# Is that normal?
shapiro.test(my_data1$value)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  my_data1$value
#> W = 0.89123, p-value = 7.393e-12
shapiro.test(my_data2$value)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  my_data2$value
#> W = 0.90426, p-value = 5.438e-11

# Compare the runs

run1 <- rle(my_data1$value)
summary(run1)
#>         Length Class  Mode   
#> lengths 186    -none- numeric
#> values  186    -none- numeric
run2 <- rle(my_data2$value)
summary(run2)
#>         Length Class  Mode   
#> lengths 198    -none- numeric
#> values  198    -none- numeric

# roughish measure of correlation
cor(run1$lengths,run2$lengths[1:186])
#> [1] 0.1323134

Created on 2020-01-08 by the reprex package (v0.3.0)

The short answer is that it depends on the function that generates your data. It's usually unknown. If it is normal (randomly) distributed, the probability p(x) for any row given x in the previous row is the same as given any other value in the previous row, 0.1666667, given your data.

Another way of saying this is that

The dice have no memory

This is the number one hardest concept for a newcomer to statistics to internalize.

Thanks a lot - I will go deeper with your reply during the next days to understand it. :slight_smile:

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.