How to create means across subjects within groups?

Hi there!

So, I'm working on this project where the premise is as follows:

There are 10 subjects in a study (or, well, in this simplified dummy example I have created here -- code will be pasted below for reproducibility). These 10 subjects collected two samples per day (once in the morning before a meal, then later at night, after a meal) for two consecutive weeks (14 days total, thus each subject has 28 samples contributed/observations/measurements making 280 samples total in the data set).

Each day they collected a FASTED sample (upon waking) and a FED sample. It is important to note that the meal between collections differed between weeks.

On week 1, donors ate a meal without fish oil, and on week 2, donors ate a meal with a serving size of fish oil. 5 biomarkers of interest were measured.

The end goal of this study is to do the following matched pairs t-tests per biomarker:

  1. AVG_FED_WK1 vs. AVG_FAST_WK1

  2. AVG_FED_WK2 vs. AVG_FAST_WK2.

  3. AVG_FED_WK1 vs AVG_FED_WK2 (aka AVG_FED_NO_FISH_OIL vs. AVG_FED_WITH_FISH_OIL), and

  4. AVG_FAST_WK1 vs AVG_FAST_WK2 (aka AVG_FAST_NO_FISH_OIL vs. AVG_FAST_WITH_FISH_OIL)

Here is the code I produced that is referenced at the top of the post:

# I'm sure there's overall a much cleaner/better way to produce the following code
# but this is just what was immediately obvious to me. 
PARENT_SAMPLE_NAME <- c()
for (number in 1:280){
  
  PARENT_SAMPLE_NAME[number] = paste("PSN", sep = "_", number)
  
}
WEEK <- rep(c(rep("WEEK1", 14), rep("WEEK2", 14)), 10)
FED_FASTED <- rep(c("FASTED","FED"), 140)
FISH_OIL <- rep(c(rep("NO", 14), rep("YES", 14)), 10)
DONOR_ID <- rep(c("S01",
                  "S02",
                  "S03",
                  "S04",
                  "S05",
                  "S06",
                  "S07",
                  "S08",
                  "S09",
                  "S10"),
                28)

SUBJECT_ID <- DONOR_ID[order(DONOR_ID)]

X01 <- exp(rnorm(n = 280, mean = 0, sd = 1))
X02 <- exp(rnorm(n = 280, mean = 0, sd = 1))
X03 <- exp(rnorm(n = 280, mean = 0, sd = 1))
X04 <- exp(rnorm(n = 280, mean = 0, sd = 1))
X05 <- exp(rnorm(n = 280, mean = 0, sd = 1))

newColumn1 <- paste(dummydf$WEEK, sep = "_" , dummydf$FED_FASTED)
newColumn2 <- paste(dummydf$FED_FASTED, sep = "_", dummydf$FISH_OIL)

dummydf <- data.frame(PARENT_SAMPLE_NAME,
                      SUBJECT_ID,
                      WEEK,
                      FED_FASTED,
                      FISH_OIL,
                      newColumn1,
                      newColumn2,
                      X01,
                      X02,
                      X03,
                      X04,
                      X05)
View(dummydf)

Here is the part I'm struggling with. I first need to average each subject’s fed measurements within each week and average all the fed measurements within each week. So for each donor, I will now have four measurements,

AVG_FED_WK1,
AVG_FAST_WK1,
AVG_FED_WK2,
AVG_FAST_WK2 (or however you want to name them).

I don't know how to go about doing this. Intuition tells me there could be some apply() type argument used here, but maybe not. Perhaps there's a more efficient way with dplyr? If my understanding is correct, I need to produce a "collapsed" data set of sorts, where instead of 10 observations of each biomarker per subject, there are only 4 and it is the average of the relevant observations per group.

This would mean that the dataset now shrinks from 280 rows to 70 rows, right? Furthermore, it seems I need to produce new columns within my data set to handle each of the two sets of t-tests (one column for the Week 1: Fed/Fasted and Week 2: Fed/Fasted t-tests, and then another for the FED: Yes/No fish oil and FASTED: Yes/No fish oil t-test).

Perhaps naively, I attempted to do this via using the paste() command and basically just merging the two relevant columns together to produce the group names that I think I need, so now it's just a question of how do I "collapse" this data set and produce a set of averages per group?

I apologize if anything is unclear. Please let me know if so and I'd be happy to provide further clarification. I appreciate you taking the time to read this question and consider its contents. Thank you!

All the best,
-Radon.

Check group_by() and summarize() functions from dplyr package, and you can store the aggregated data in a different data frame.

To calculate the mean and standard deviation of each group, you can use the functions suggested by @ibertchen. Here is an example.

library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
PARENT_SAMPLE_NAME <- c()
for (number in 1:280){
  
  PARENT_SAMPLE_NAME[number] = paste("PSN", sep = "_", number)
  
}
WEEK <- rep(c(rep("WEEK1", 14), rep("WEEK2", 14)), 10)
FED_FASTED <- rep(c("FASTED","FED"), 140)
FISH_OIL <- rep(c(rep("NO", 14), rep("YES", 14)), 10)
DONOR_ID <- rep(c("S01",
                  "S02",
                  "S03",
                  "S04",
                  "S05",
                  "S06",
                  "S07",
                  "S08",
                  "S09",
                  "S10"),
                28)

SUBJECT_ID <- DONOR_ID[order(DONOR_ID)]

X01 <- exp(rnorm(n = 280, mean = 0, sd = 1))
X02 <- exp(rnorm(n = 280, mean = 0, sd = 1))
X03 <- exp(rnorm(n = 280, mean = 0, sd = 1))
X04 <- exp(rnorm(n = 280, mean = 0, sd = 1))
X05 <- exp(rnorm(n = 280, mean = 0, sd = 1))

newColumn1 <- paste(WEEK, sep = "_" , FED_FASTED)
newColumn2 <- paste(FED_FASTED, sep = "_", FISH_OIL)

dummydf <- data.frame(PARENT_SAMPLE_NAME,
                      SUBJECT_ID,
                      WEEK,
                      FED_FASTED,
                      FISH_OIL,
                      newColumn1,
                      newColumn2,
                      X01,
                      X02,
                      X03,
                      X04,
                      X05)

#Calculate means for X01
X01Mean <- dummydf |> group_by(FED_FASTED, FISH_OIL) |> 
  summarize(MeanX01 = mean(X01), SDX01 = sd(X01))
#> `summarise()` has grouped output by 'FED_FASTED'. You can override using the
#> `.groups` argument.
X01Mean
#> # A tibble: 4 × 4
#> # Groups:   FED_FASTED [2]
#>   FED_FASTED FISH_OIL MeanX01 SDX01
#>   <chr>      <chr>      <dbl> <dbl>
#> 1 FASTED     NO          2.10  4.31
#> 2 FASTED     YES         1.66  2.29
#> 3 FED        NO          1.55  1.57
#> 4 FED        YES         1.49  1.47

#Do all the markers at once
dummyLong <- dummydf |> pivot_longer(cols = X01:X05, names_to = "Marker")
head(dummyLong)
#> # A tibble: 6 × 9
#>   PARENT_SAMPLE_NAME SUBJEC…¹ WEEK  FED_F…² FISH_…³ newCo…⁴ newCo…⁵ Marker value
#>   <chr>              <chr>    <chr> <chr>   <chr>   <chr>   <chr>   <chr>  <dbl>
#> 1 PSN_1              S01      WEEK1 FASTED  NO      WEEK1_… FASTED… X01    0.238
#> 2 PSN_1              S01      WEEK1 FASTED  NO      WEEK1_… FASTED… X02    2.11 
#> 3 PSN_1              S01      WEEK1 FASTED  NO      WEEK1_… FASTED… X03    0.888
#> 4 PSN_1              S01      WEEK1 FASTED  NO      WEEK1_… FASTED… X04    0.736
#> 5 PSN_1              S01      WEEK1 FASTED  NO      WEEK1_… FASTED… X05    2.53 
#> 6 PSN_2              S01      WEEK1 FED     NO      WEEK1_… FED_NO  X01    0.233
#> # … with abbreviated variable names ¹​SUBJECT_ID, ²​FED_FASTED, ³​FISH_OIL,
#> #   ⁴​newColumn1, ⁵​newColumn2

Means <- dummyLong |> group_by(FED_FASTED, FISH_OIL, Marker) |> 
  summarize(MeanMarker = mean(value), SDMarker = sd(value))
#> `summarise()` has grouped output by 'FED_FASTED', 'FISH_OIL'. You can override
#> using the `.groups` argument.
Means
#> # A tibble: 20 × 5
#> # Groups:   FED_FASTED, FISH_OIL [4]
#>    FED_FASTED FISH_OIL Marker MeanMarker SDMarker
#>    <chr>      <chr>    <chr>       <dbl>    <dbl>
#>  1 FASTED     NO       X01          2.10     4.31
#>  2 FASTED     NO       X02          1.32     1.21
#>  3 FASTED     NO       X03          1.80     2.27
#>  4 FASTED     NO       X04          1.78     1.81
#>  5 FASTED     NO       X05          1.66     2.65
#>  6 FASTED     YES      X01          1.66     2.29
#>  7 FASTED     YES      X02          1.68     1.67
#>  8 FASTED     YES      X03          1.35     1.30
#>  9 FASTED     YES      X04          1.69     2.15
#> 10 FASTED     YES      X05          1.55     1.91
#> 11 FED        NO       X01          1.55     1.57
#> 12 FED        NO       X02          1.61     1.92
#> 13 FED        NO       X03          1.76     1.83
#> 14 FED        NO       X04          1.63     2.65
#> 15 FED        NO       X05          1.76     1.81
#> 16 FED        YES      X01          1.49     1.47
#> 17 FED        YES      X02          1.47     1.42
#> 18 FED        YES      X03          1.44     1.17
#> 19 FED        YES      X04          1.69     2.98
#> 20 FED        YES      X05          1.79     1.84

Created on 2023-01-11 with reprex v2.0.2
However, to calculate the test results, you need to use the raw data, not the calculated means. Also, consider using an ANOVA test before doing multiple t tests.

Thanks for the detailed response! The functions @ibertchen suggested are certainly useful. I've been trying to figure out how to configure them for my problem and was getting stuck. I've also been trying the aggregate() function too, but I really like what you've presented here.

I see you're using this |> operator a lot, is that a new pipe operator? What is the command you are using to generate that? I've never seen it before.

Yes, |> is a pipe operator introduced at R 4.1, I think. You can just type the two characters or set the RStudio option to use it as the result of CTRL + Shift + M under the menu Tools -> Global Options -> Code

That's cool.

I was actually able to figure out the solution to my problem. I played around with the solutions here and wound up getting what I was seeking. I thought I'd share it here just for the sake of completeness. Perhaps others might find it useful.

PARENT_SAMPLE_NAME <- c()
for (number in 1:280){
  PARENT_SAMPLE_NAME[number] = paste("PSN", sep = "_", number)
}
WEEK <- rep(c(rep("WEEK1", 14), rep("WEEK2", 14)), 10)
FED_FASTED <- rep(c("FASTED","FED"), 140)
FISH_OIL <- rep(c(rep("NO", 14), rep("YES", 14)), 10)
DONOR_ID <- rep(c("S01", "S02", "S03", "S04", "S05","S06","S07", "S08","S09", "S10"), 28)

SUBJECT_ID <- DONOR_ID[order(DONOR_ID)]

sampleMetaData <- data.frame(PARENT_SAMPLE_NAME, SUBJECT_ID, WEEK, FED_FASTED, FISH_OIL)

X01 <- exp(rnorm(n = 280, mean = 0, sd = 1))
X02 <- exp(rnorm(n = 280, mean = 0, sd = 1))
X03 <- exp(rnorm(n = 280, mean = 0, sd = 1))
X04 <- exp(rnorm(n = 280, mean = 0, sd = 1))
X05 <- exp(rnorm(n = 280, mean = 0, sd = 1))

batchNormImputedData <- data.frame(PARENT_SAMPLE_NAME, X01, X02, X03, X04, X05)
rm(PARENT_SAMPLE_NAME, SUBJECT_ID, WEEK, FED_FASTED, FISH_OIL, X01, X02, X03, X04, X05, DONOR_ID, number)

# Combine meta data and batch data and removing applicable samples
workingData <- left_join(sampleMetaData, batchNormImputedData, by = "PARENT_SAMPLE_NAME")

# Separating out fed and fast data
fedData <- workingData %>% filter(FED_FASTED == "FED")
fastData <- workingData %>% filter(FED_FASTED == "FASTED")
# rm(workingData)

# Average fed and fast data per donor
avgFedData <- fedData %>% group_by(FED_FASTED, SUBJECT_ID, WEEK) %>%
  summarise(across(X01:X05, mean, .groups = "drop_last"))
fedID <- fedData %>% group_by(SUBJECT_ID, WEEK) %>% filter(row_number() == 1) %>% select(PARENT_SAMPLE_NAME, SUBJECT_ID, WEEK)
avgFedData <- left_join(fedID, avgFedData)

avgFastData <- fastData %>% group_by(FED_FASTED, SUBJECT_ID, WEEK) %>%
  summarise(across(X01:X05, mean, .groups = "drop_last"))
fastID <- fastData %>% group_by(SUBJECT_ID, WEEK) %>% filter(row_number() == 1) %>% select(PARENT_SAMPLE_NAME, SUBJECT_ID, WEEK)
avgFastData <- left_join(fastID, avgFastData)

batchNormImputedData <- rbind(avgFedData, avgFastData)

Still, thank you for taking the time to help me and get me on the path I needed. Cheers!