values range for 95% of the subjects

Dallak · April 21, 2022, 6:54am

Dear all,
Thank you for all your support and help that I have received so far from this community.

I'm working on the following dataset, and I would like to calculate the range of values for 95% of the speakers for both columns votvd and votvl.

    speaker      votvd      votvl
1  00-M-f04 0.05381864 0.02563282
2  00-Y-f03 0.05909734 0.02136499
3  00-Y-f02 0.04568184 0.01828234
4  00-M-f01 0.05474888 0.02120949
5  00-M-f06 0.06269178 0.01647195
6  70-Y-f03 0.05463603 0.02231716
7  00-Y-f06 0.05470651 0.01782035
8  70-O-f03 0.05123922 0.01738909
9  00-O-f03 0.04375921 0.01616929
10 70-M-f01 0.04228886 0.01998891
11 00-O-f01 0.04210959 0.01687892
12 00-O-f02 0.04471604 0.02048789
13 70-M-f02 0.03971611 0.01403043
14 70-Y-f02 0.06074638 0.01355691
15 70-O-f04 0.04915699 0.02119257
16 00-O-f05 0.05579494 0.01725491
17 70-Y-f01 0.03735125 0.01577395
18 70-M-f04 0.04616147 0.01901408
19 70-Y-f04 0.04636063 0.01615609
20 00-M-f03 0.05671241 0.02621205
21 70-M-f07 0.05455009 0.01966456
22 70-O-f01 0.05379974 0.02257897
23 00-Y-f01 0.04546661 0.01847809

Here is the data.

data <- structure(list(speaker = c("00-M-f04", "00-Y-f03", "00-Y-f02", "00-M-f01", 
"00-M-f06", "70-Y-f03", "00-Y-f06", "70-O-f03", "00-O-f03", "70-M-f01", 
"00-O-f01", "00-O-f02", "70-M-f02", "70-Y-f02", "70-O-f04", "00-O-f05", 
"70-Y-f01", "70-M-f04", "70-Y-f04", "00-M-f03", "70-M-f07", "70-O-f01", 
"00-Y-f01"), votvd = c(0.0538186361816087, 0.0590973443704265, 
0.0456818407451248, 0.0547488762884262, 0.062691784096462, 0.054636032040423, 
0.0547065128257382, 0.0512392236172749, 0.0437592077504489, 0.0422888589173195, 
0.0421095882310396, 0.0447160447066727, 0.0397161050321998, 0.0607463788135851, 
0.04915699000058, 0.055794941335901, 0.0373512463572469, 0.0461614729033426, 
0.0463606295363043, 0.0567124147450744, 0.0545500851509402, 0.0537997365006125, 
0.0454666136349681), votvl = c(0.0256328208390501, 0.0213649868637071, 
0.0182823350591374, 0.0212094920417251, 0.0164719453186502, 0.0223171564809505, 
0.0178203531852858, 0.0173890929808758, 0.0161692865783799, 0.0199889141195467, 
0.0168789203574063, 0.0204878908105645, 0.0140304290078088, 0.0135569091088139, 
0.0211925748569302, 0.0172549136324653, 0.0157739488880231, 0.0190140833820649, 
0.0161560917047786, 0.02621204948485, 0.0196645571410369, 0.0225789744983796, 
0.0184780850804763)), row.names = c(NA, 23L), class = "data.frame")

More specifically, I want to say something like:
Most speaker (95%) have an overall value between ... and ..., compared to the population mean.

So. if I use something like:

data %>%
    pivot_longer(!speaker, names_to = "vot", values_to = "value") -> d1

sapply(d1,ci,ci=0.95)

$value
95% ETI: [14.25, 60.54]

Is this correct? That is, does this mean that 95% of the speaker have an overall value between 14.25 and 60.54? Or this means 95% of the values fall within this range with no reference to the percentage of the speaker involved in this calculation. Am I missing something?

I want a way to support the first interpretation, please.
Thank you in advance!

yifanliu · April 21, 2022, 8:51am

The ci function is used to calculate confidence intervals, which represents the mean of a given vector plus and minus the variation in that vector, i.e. mean(x)±sd(x). If you're exactly calculating 95% CI, this value equals mean(x)±1.96sd(x).

The range of CI95 means, in the next random sampling of the given sample, it would be of 95% possibility to get a value within the range. So, it is not about the percentile or the percentage of your data.

In general, CI95 just represents the distribution of the density: if a 95%CI is a small range around the mean value, the data is centralize-distributed, or if the 95%CI is a wide range, it shows the data distributed widely.

In your case, CI95 = (14.25,60.54) means the average value of all speakers is 37.395, and you're 95% sure that in the next random sampling from all the speakers, the value would be inside the range(14.25,60.54).

nirgrahamuk · April 21, 2022, 9:29am

quantile(d1$value,c(.025,1-.025))
2.5% 97.5%
0.01424837 0.06054025

If you want the empirical bounds of 95% of your values, take 5% and half it, i.e. as you want to exclude the top or bottom most 2.5% leaving the remaining 95% of the middle...

for the example as given 95% of your recorded quantities are between 0.0142 and 0.061; however this is not exactly the same as 95% of speakers, as each speaker had 2 contributions to the values, and we will have only calculated a metric for 95% of the values

Dallak · April 23, 2022, 12:56pm

Thank you all for your clarifications. It helps a lot.

system · April 30, 2022, 12:57pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.