Select a percentage of rows for each group and calculate a stastistics for a specific column

H everyone, I have a little problem, and I hope you can help me.

I have a dataset:

value <- c(3,4,5,4,3,4,5,4,3,4,3,4,6,5,4,5,6,7,5,4,6,4,6,3,2,6,7,4,7,8,5,6,7,5,5,7,8,7,45,3,4,3,4,6,4,3,2,1,2,3,4)
Sub <- c(rep("sub1", 10), rep("sub2",25), rep("sub3", 16))

dat <-as.data.frame(Sub, value)

It'a only an example, but it is useful to give you an idea of my purposes.

What I need to do is:

  1. to divide for each subject the variable value in 10 part and to calculate a statistics, such as the mean
  2. I should create a dataframe like this:
  p1    p2    p3    p4    p5    p6    p7   p8    p9   p10
Sub1   
Sub2

Sub...

I used a function like this to calculate a percentage of rows to analyze.

subset_percent <- function(x, start=0, end=100) {
  stopifnot(
    is.numeric(start), is.numeric(end),
    start < end, start >= 0, end <= 100
  )
  nr <- nrow(x)
  rows <- 1:nr
  srt <- ceiling(start*nr/100)
  end <-floor(end*nr/100)
  x[srt:end,]
}

My intent was to use dplyr to apply this fu> nction to each group:

newdat <- dat %>% 
  group_by(Sub) %>% 
  summarise(
p10 = mean(subset_percent(dat,0,10)$value),
p20 = mean(subset_percent(dat,10,20)$value),
p30 = mean(subset_percent(dat,20,30)$value),
p40 = mean(subset_percent(dat,30,40)$value),
p50 = mean(subset_percent(dat,40,50)$value),
p60 = mean(subset_percent(dat,50,60)$value),
p70 = mean(subset_percent(dat,60,70)$value),
p80 = mean(subset_percent(dat,70,80)$value),
p90 = mean(subset_percent(dat,80,90)$value),
p100 = mean(subset_percent(dat,90,100)$value))

However the output was not correct, because I got this erroneous result:

Sub     p10   p20   p30   p40   p50   p60   p70   p80   p90  p100
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 sub1    3.8   3.9  4.07   4.4  4.36   4.7  4.83  5.98  5.78  5.39
2 sub2    3.8   3.9  4.07   4.4  4.36   4.7  4.83  5.98  5.78  5.39
3 sub3    3.8   3.9  4.07   4.4  4.36   4.7  4.83  5.98  5.78  5.39

Do you have any advice?

Thank you

Hi,

Welcome to the RStudio community!

Thanks for creating a nice reproducible example, that's making this much easier. If I understand you correctly, you can do what you want to achieve in just a few lines:

library(dplyr)
library(tidyr)

value <- c(3,4,5,4,3,4,5,4,3,4,3,4,6,5,4,5,6,7,5,4,6,4,6,3,2,
          6,7,4,7,8,5,6,7,5,5,7,8,7,45,3,4,3,4,6,4,3,2,1,2,3,4)
Sub <- c(rep("sub1", 10), rep("sub2",25), rep("sub3", 16))

dat <- data.frame(sub = Sub, value = value)

#Set number of groups
nGroups = 10

#Add groups column
dat = dat %>% group_by(sub) %>% 
 mutate(group = 0:(n()-1) %% nGroups + 1)
dat
#> # A tibble: 51 x 3
#> # Groups:   sub [3]
#>    sub   value group
#>    <chr> <dbl> <dbl>
#>  1 sub1      3     1
#>  2 sub1      4     2
#>  3 sub1      5     3
#>  4 sub1      4     4
#>  5 sub1      3     5
#>  6 sub1      4     6
#>  7 sub1      5     7
#>  8 sub1      4     8
#>  9 sub1      3     9
#> 10 sub1      4    10
#> # ... with 41 more rows

#Summarise the groups per subject
dat = dat %>% 
 group_by(sub, group) %>% 
 #Replace 'mean' by any summary function you like
 summarise(value = mean(value), .groups = "drop")
dat
#> # A tibble: 30 x 3
#>    sub   group value
#>    <chr> <dbl> <dbl>
#>  1 sub1      1     3
#>  2 sub1      2     4
#>  3 sub1      3     5
#>  4 sub1      4     4
#>  5 sub1      5     3
#>  6 sub1      6     4
#>  7 sub1      7     5
#>  8 sub1      8     4
#>  9 sub1      9     3
#> 10 sub1     10     4
#> # ... with 20 more rows

#Convert to wide format
dat = dat %>% 
 pivot_wider(sub, names_from = group, values_from = value)
dat
#> # A tibble: 3 x 11
#>   sub     `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`
#>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 sub1   3     4     5     4     3      4     5     4       3     4
#> 2 sub2   4.67  4.67  6.33  4.33  3.67   5.5   6.5   5.5     6     6
#> 3 sub3   5     5     4    23.5   3      4     3     4       6     4

Created on 2021-12-12 by the reprex package (v2.0.1)

Here is all the functions combined into one:

dat = dat %>% group_by(sub) %>% 
  mutate(group = 0:(n()-1) %% 10 + 1) %>% 
  group_by(sub, group) %>% 
  summarise(value = mean(value), .groups = "drop") %>% 
  pivot_wider(sub, names_from = group, values_from = value)

We are now creating 10 groups by consecutively taking samples from each subject. Should you rather create random groups (still respecting the subjects of course), you can add one more line

dat = dat %>% group_by(sub) %>% 
  mutate(
    group = 0:(n()-1) %% 10 + 1,
    #Shuffle the groups within subject to make them random
    group = sample(group, n())) %>% 
  group_by(sub, group) %>% 
  summarise(value = mean(value), .groups = "drop") %>% 
  pivot_wider(sub, names_from = group, values_from = value)

The results will now differ depending on the randomness of the groups

Hope this helps,
PJ

1 Like

Dear pieterjanvc,
thank you for your support. Your script works perfectly and it is well written!
I learnt something new.

Bets regards
Francesco

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.