About the meaning of moderndive bootstrap 95CI.

Rsky · November 22, 2020, 8:39am

I am learning on this site.

Chapter 8 Bootstrapping and Confidence Intervals | Statistical Inference via Data Science

And an unsure expression came up.
It's about bootstrap.

I didn't understand the "common mistakes" part about 95% of bootstraps.

Here are my thoughts

A bootstrap is the restoration and extraction of a single sample no1.
The average value of the 1000 "re-samples" created by the bootstrap matches the average value of the sample no1 by the central limit theorem.
The standard error of the "re-sample" agrees with the variance of the sample distribution that has been sampled from the population. (An empirical fact?)
The interpretation of "95%" is "the amount of confidence".
The mathematical meaning is in the range of mean to 1.96 sigma.
The wider the range, the more population parameters can be expected to be included.

Neither the 95% nor the 100% percentile includes the true value if the sample you chose happens to be biased, right?

First, is this thoughts correct?

And this description also confused me.

N = US population.
n = 50 sample.
sampling n * 100 from N ? (= 500 people?)
and
100 confidence intercals(=50 people 100 group, each group bootstrapped and make CI ?)
95 out of 100 contain true parameters.

Secondly,

In the first question I understood that 95% is not a clear probability, but a range of distance spread, but isn't this story discussing the probability of the population parameters entering?
It is difficult to understand.

Please let me know if there are any points I'm making mistakes with, or paraphrases or sites that will help me understand.

thank you.

AlexisW · November 24, 2020, 12:50am

I think you are missing something. Let's imagine we have a population, of which you can measure one particular parameter X. Let's assume the real population follows a normal distribution of mean \mu = 10 and variance \sigma = 1. Of course, you don't know these actual values, instead you measure a sample of n=10 individuals. In R code:

mu <- 10
sigma <- 1
n <- 10

set.seed(10)
my_sample <- rnorm(n = n, mean = mu, sd = sqrt(sigma))

In that context, we actually have a formula for the confidence interval at 95%:
CI_{95} = (\bar{x} - t^*\frac{s}{\sqrt{n}} ; \bar{x} + t^*\frac{s}{\sqrt{n}})

We can compute it on R:

mean(my_sample)
sd(my_sample)

ci_low <- mean(my_sample) - 1.96 * sd(my_sample) / sqrt(n)
ci_high <- mean(my_sample) + 1.96 * sd(my_sample) / sqrt(n)

paste0("CI_95: (", round(ci_low,2),"; ", round(ci_high, 2),")")

And, oh, what a surprise, for that particular sample, the CI is (9.08; 9.94), it does NOT contain the true mean of the population! Here, obviously, I cheated and selected that seed on purpose. You can try to rerun these lines without touching the seed, so that you get random samples, you will see that most of them contain the true mean of the population, but not all.

So, staying with that particular sample, let me ask you this:

What is the probability for the CI to contain the true mean of the population?

The answer is obvious: 10 is not in the interval [9.08; 9.94], so P(10 \in [9.08;9.94]) = 0

Alternatively, say you had selected another sample from that same population:

set.seed(1)
my_sample <- rnorm(n = n, mean = mu, sd = sqrt(sigma))

mean(my_sample)
sd(my_sample)

ci_low <- mean(my_sample) - 1.96 * sd(my_sample) / sqrt(n)
ci_high <- mean(my_sample) + 1.96 * sd(my_sample) / sqrt(n)

paste0("CI_95: (", round(ci_low,2),"; ", round(ci_high, 2),")")
#> [1] "CI_95: (9.65; 10.62)"

This time, you can write that P(10 \in [9.65;10.62]) = 1

So, the fact that a given CI for a given sample contains or not the true mean of the population is not a random variable, it is an actual fact of nature, it is 0 or 1.

Other examples could include:

What is the probability that there is water on Earth?

(obviously, it's 1, no one can doubt it)

What is the probability that your first name is John?

In that case, I have no idea. But you know your name, and you know that either it is John, with probability 1, or it's not, and the probability is 0.

So, what's the point of confidence intervals? That's where the frequentist framework comes in play (the alternative, the bayesian framework, actually uses probabilities as subjective measures of confidence, but that's not the one used here). The idea is that if you take lots and lots of random samples from that same population, and for each one compute the 95% CI, on average 95% of the CI will contain the true mean. So here the random variable is the CI itself (or the sample), not the true mean. One practical consequence is that if I'm about to take a sample from the population, before I do anything, I have a 95% chance that my sample will contain the true mean; once I've taken the sample the probabilities collapse and it's either 0 or 1. So, if before I'm taking any sample I decide that I'll do this if the CI contains 0, and that if the CI doesn't, then I have only a 5% chance to make the wrong decision.

Finally, a note on the bootstrap: in my example, I could directly use a formula to compute the CI, but if I didn't know that the true population follows a normal distribution I may not have such a formula. Bootstrapping is just a way, by resampling from the sample, to make a similar inference. Formula and bootstrap are just two ways to get an estimate of the CI.

I think I answered that, to some extent yes, the CI does give an idea of the spread, but it's not its definition (that would be the variance). Rather, it can be seen as a measure of the precision (see in particular its dependence to sample size).

AlexisW · November 24, 2020, 12:53am

And I would suggest an exercise: if you compute the CI for 100 samples, how many times will the true population mean be in the CI? Make a prediction and verify it using my code above except for the set.seed().

Rsky · November 25, 2020, 11:57am

Perhaps my question is clear!
It was difficult to come up with a question and the question was vague, but you understood it perfectly.
Thank you.

So when I make 100 95 CIs, 95 95 CIs contains the true average.
And containing the true mean is 0,1.

count <- 0
for(i in 1:100){
get resample[i]
compute resample 95CI

if(include true 'mean' == 1){
count <- count +1
}else{}

}

count

maybe return 95.

system · December 16, 2020, 11:57am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.