Extract a sample with normal distribution from a dataset

Hi,

the following code generates 50 random numbers drawn from a fixed set:

a<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
sample(a, 50, replace = TRUE)``

Question: how should I modify the code so that the extraction takes place with normal distribution with mean = 9 and standard deviation = 3?

Thanks

M.

Hello,

So it is not entirely clear what you want to do. If you want to create a random generated normal distribution with mean = 9 and sd = 3 for n = 50 you can simply do:

df <- rnorm(n = 50, mean = 9, sd = 3)

In your case, you're saying we need to pick 50 numbers from the above in an almost non random way so we can guarantee it having the properties mean = 9 and sd = 3 while only having 50 values and also being normally distributed. You see the problem is your set of values aren't normally distributed to begin with so we can't simply sample from that and get all the properties.

For example, I fix the set of the first 9 Fibonacci numbers:

(1,1,2,3,5,8,13,21,34)

I would like to generate a sequence of 200 integer numbers by randomly extracting them from that set of 9 elements. However, I would like the distribution (of the 200 numbers drawn) not to be uniform but gaussian: obviously I consider a discrete distribution that looks like the normal distribution. For example, the central element ("5") must have maximum frequency in the sequence of 200 numbers; the other numbers have decreasing frequency: "1" and "34" minimum frequency.

For an uniform distribution I know how:

a<-c(1,1,2,3,5,8,13,21,34)
sample(a, 200, replace = TRUE)

...but I don't know how using a discrete distribution like the one I explained above.

Bye,

M.

It's a bit of an odd request but maybe this is what you're looking for:

a<- c(1,1,2,3,5,8,13,21,34)
a <- unique(a)
qa <- qqnorm(a)

qprob <- dnorm(qa$x)
x <- sample(a, 2000, replace = TRUE, prob=qprob)
hist(x, breaks=0:34)

Created on 2021-01-24 by the reprex package (v0.3.0)

I was thinking about your conditions again. Essentially, we can take rnorm and then simply round those values. The only limitation here is that your set.seed value will matter for the session as you can technically hit 15.6 or something. As you can see in this simple example below we only have values that fall within a however we do not have 1 in the set (which can easily happen given how small their occurrence should be within this distribution shape). This will be iterative depending on how close you need to get to exactly 9 and 3. I hope this helps?

set.seed(2)

df <- rnorm(n = 50, mean = 9, sd = 3)


df <- round(df)

df
#>  [1]  6 10 14  6  9  9 11  8 15  9 10 12  8  6 14  2 12  9 12 10 15  5 14 15  9
#> [26]  2 10  7 11 10 11 10 12  8  7  7  4  6  7  8  8  3  6 15 11 15  8  9  8  5

hist(df)



min(df)
#> [1] 2
max(df)
#> [1] 15

Created on 2021-01-25 by the reprex package (v0.3.0)

It works fine for that particular example.
Thanks,

M.

Great :slight_smile: just mark the post which was the solution for your problem.

Your code works fine, it does just what I wanted.
[Now I just have to figure out how to have a greater or lesser dispersion of the numbers around the average as I like, and also establish the average value]

Thanks,
M.

Hello,

your code works fine for "that" particular example, but it is not generalizable to any starting set. The code proposed by StatSteph instead is applicable to a completely generic starting set and does what I intended. However your code is very useful for me to learn new functions.
Thank you,

Bye
M.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.