Comparing distributions in R

Good afternoon. Could you, please, tell me, if there is any function-criterion for comparing empirical and theoretical (continuous) distribution of data in R? Or maybe it is necessary to create such a function by yourself.

Hi,

The solution depends on what data you have. Are you comparing numeric values on a continuous scale to a distribution that can be represented by a function, or are you looking at categorical data with discrete values?

PJ

The empirical dataset is comprised of numeric values from a continuous distribution.

The empirical dataset is comprised of numeric values from a continious distribution.

Try looking at:

?qqplot
1 Like

I have been looking for quite a while for solutions, but all I find and know are built-in to linear model functions where you fit a model to data and then test the goodness of fit (like F-statistic or Chi-Squared). I have been trying to find a way of providing your own function instead of fitting the model to data, but can only find the theory (e.g. mathematics) of doing this and not ready-made functions.

qqplot is also something I read about, but I don't know how to use it with your own distribution provided as a function in R, instead of fitting it to your test data.

OP was looking for the above, which can be achieved like so:

s = rnorm(100)
p = seq(0.001, 0.999, length.out = 100)
x1 = quantile(x = s, probs = p)
x2 = qnorm(p = p)
x3 = qpois(p, lambda = 2)
plot(x1, x2)
plot(x1, x3)

Hope it helps :slightly_smiling_face:

Thanks for helping. Nevertheless, I'm more interested in parametric (statistical) tests, that can give something like an empirical value of the criterion, which can be then compared to a theoretical value of the criterion (with the predefined probability and the degrees of freedom). After comparing 2 values, it should become possible to make an unambiguous decision, whether the chosen theoretical distribution fits the empirical data or not.

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/ks.test.html

No and it never will. You can e.g. assume normality, but it remains an assumption. There exists normality tests, but these are sensitive to number of observations and outliers.

Look at this e.g.:

That doesn't look normal - right?

Well, it is - I created it using this code:

library("tidyverse")
n <- 50
d <- tibble(
  s = rnorm(n = n,
            mean = 0,
            sd = 1),
  p = seq(from = 0.001,
          to = 0.999,
          length.out = n),
  x = qnorm(p = p,
            mean = 0,
            sd = 1),
  y = quantile(x = s, probs = p)
)

pl <- d %>% 
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  geom_abline(slope = 1,
              intercept = 0,
              linetype = "dashed") +
  theme_minimal()

print(pl)
1 Like