Analysis of Variance using R

Boboye.x · November 8, 2022, 10:56am

I have this small dataset and I am attempting to find the mean, standard deviation and analysis of variance.

Europeans <- c(16, 22, 11, 14, 19, 16, 23, 22, 13, 23, 21, 18, 15, 16, 13, 13)
Americans <- c(20, 16, 16, 18, 21, 7, 21, 19, 9, 19, 19, 12, 15, 16, 20, 30, 15, 23, 17)
Australasians <- c(22, 18, 18, 18, 19, 17, 19, 17, 12, 13, 19, 21, 14, 14, 16, 17, 15, 18, 14, 15, 20, 16, 16, 17)

Here's the hypothetical question - People's region of birth is associated with their decisiveness, such that Europeans will score higher on decisiveness than the other two groups.

I want to find the mean and standard deviation for each group. Any help will be much appreciated

FJCC · November 8, 2022, 2:48pm

Here are two methods for calculating the mean and standard deviation.

Europeans <- c(16, 22, 11, 14, 19, 16, 23, 22, 13, 23, 21, 18, 15, 16, 13, 13)
Americans <- c(20, 16, 16, 18, 21, 7, 21, 19, 9, 19, 19, 12, 15, 16, 20, 30, 15, 23, 17)
Australasians <- c(22, 18, 18, 18, 19, 17, 19, 17, 12, 13, 19, 21, 14, 14, 16, 17, 15, 18, 14, 15, 20, 16, 16, 17)
#Method #1
AvgE <- mean(Europeans)
SdE <- sd(Europeans)

AvgAm <- mean(Americans)
SdAm <-  sd(Americans)

AvgAu <- mean(Australasians)
SdAu <- sd(Australasians)

AvgE
#> [1] 17.1875
SdE
#> [1] 4.020261

AvgAm
#> [1] 17.52632
SdAm
#> [1] 5.070266

AvgAu
#> [1] 16.875
SdAu
#> [1] 2.507597

#Method #2
DF <- data.frame(Origin = c(rep("Eur", length(Europeans)),
                            rep("Am", length(Americans)),
                            rep("Aus", length(Australasians))),
                 Deci = c(Europeans, Americans, Australasians)
)
head(DF)                            
#>   Origin Deci
#> 1    Eur   16
#> 2    Eur   22
#> 3    Eur   11
#> 4    Eur   14
#> 5    Eur   19
#> 6    Eur   16
library(dplyr)

STATS <- DF |> group_by(Origin) |> summarise(Avg = mean(Deci),
                                             Sig = sd(Deci))
STATS
#> # A tibble: 3 × 3
#>   Origin   Avg   Sig
#>   <chr>  <dbl> <dbl>
#> 1 Am      17.5  5.07
#> 2 Aus     16.9  2.51
#> 3 Eur     17.2  4.02

^{Created on 2022-11-08 with reprex v2.0.2}

fcas80 · November 8, 2022, 6:42pm

The original post also asked for an analysis of variance. An analysis of variance can be accomplished with aov after running a linear regression, but in this example the sample sizes of the three variables are not equal.

How does one run aov with different sample sizes?

The hypothesis, "Do Europeans will score higher on decisiveness than the other two groups", sounds like a t-test problem. Europeans can be compared separately to each of the other two groups. For example,

t.test(Europeans, Americans, alternative = "greater", var.equal = FALSE)

FactOREO · November 8, 2022, 6:59pm

An AOV can be performed on a data.frame with either the lm() function or the aov() function (either, since an AOV is an usual regression fitting). It works regardless of equal sample sizes (actually, it would be pretty useless if it was restricted to that). Here is an example on how to calculate an AOV in R with your data:

Europeans <- c(16, 22, 11, 14, 19, 16, 23, 22, 13, 23, 21, 18, 15, 16, 13, 13)
Americans <- c(20, 16, 16, 18, 21, 7, 21, 19, 9, 19, 19, 12, 15, 16, 20, 30, 15, 23, 17)
Australasians <- c(22, 18, 18, 18, 19, 17, 19, 17, 12, 13, 19, 21, 14, 14, 16, 17, 15, 18, 14, 15, 20, 16, 16, 17)

Data <- data.frame(
  Continent = as.factor(c(rep('Europeans',length(Europeans)), rep('Americans',length(Americans)), rep('Australasians',length(Australasians)))),
  value = c(Europeans,Americans,Australasians)
)
model_lm <- lm(value ~ Continent, data = Data)
model_aov <- aov(value ~ Continent, data = Data)

summary(model_lm)
#> 
#> Call:
#> lm(formula = value ~ Continent, data = Data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -10.526  -2.357   0.125   2.125  12.474 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)             17.5263     0.8937  19.611   <2e-16 ***
#> ContinentAustralasians  -0.6513     1.1962  -0.544    0.588    
#> ContinentEuropeans      -0.3388     1.3218  -0.256    0.799    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.896 on 56 degrees of freedom
#> Multiple R-squared:  0.005274,   Adjusted R-squared:  -0.03025 
#> F-statistic: 0.1485 on 2 and 56 DF,  p-value: 0.8624
summary.aov(model_lm)
#>             Df Sum Sq Mean Sq F value Pr(>F)
#> Continent    2    4.5   2.253   0.148  0.862
#> Residuals   56  849.8  15.175
summary(model_aov)
#>             Df Sum Sq Mean Sq F value Pr(>F)
#> Continent    2    4.5   2.253   0.148  0.862
#> Residuals   56  849.8  15.175

^{Created on 2022-11-08 by the reprex package (v2.0.1)}

As you can see, in your sample there is no statistically significant difference in means between all the groups (F-statistic 0.148 with p-value 0.8624). Looking at the coefficients you can see that the intercept (the baseline for the AOV) is statistically significant, but the slope of European and Australasians is not (which also indicates to some extend the missing statistical significance of the AOV).

In conclusion, you cannot say that there is a statistically significant difference in means between your defined groups with the given sample data.

Kind regards

Edit: As a side note, do not try to do several t-tests as a standard procedure. You will increase your likelihood of finding any statistically significant result if you just have enough groups to test on. Use AOV (=omnibus test, e.g. you know there is/is not anything to find, but not where exactly) to see if there is anything to see at all and if it is significant, you should specify your hypothesis and test for those specific cases, instead of testing for everything and adjusting your hypothesis on the way out to your results.

system · November 15, 2022, 7:00pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.