# How get mean by each 5 columns in a big data set?

Hi Commnity i'm have a data frame with many columns. I need to calculate the mean by each 5 column. The first column is the `ID`

For example, first mean are column 1 to 5, the second one is column 6 to 10.

How make this for all 35 columns. I have othe data with 800 columns.

``````library(dplyr)
spectra |> group_by(Wavelength) |>
summarise(media= c(spectrum00001.asd.sco+spectrum00002.asd.sco+
spectrum00003.asd.sco+spectrum00004.asd.sco+
spectrum00001.asd.sco)/5) # this run well but is so difficult make one by one.

spectra |> group_by(Wavelength) |>  # this run wrong
mutate(promedio = rowMeans(across(seq(1, ncol(spectra), by = 4),
.names = "prom_{.col}")))
``````

Tnks!

I would pivot the data to a long format, extract the spectrum number (e.g. 01,02, etc), make groups by integer division of those numbers by 5, and calculate the average of the groups for each wavelength.

``````library(tidyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#>     filter, lag
#> The following objects are masked from 'package:base':
#>
#>     intersect, setdiff, setequal, union
library(stringr)
DFlong <- DF |> pivot_longer(cols = -Wavelength, names_to = "Label", values_to = "Value")
DFlong <- DFlong |> mutate(Group = (as.numeric(str_sub(Label,12,13)) -1) %/% 5)
DFlong
#> # A tibble: 75,285 × 4
#>    Wavelength Label                 Value Group
#>         <int> <chr>                 <dbl> <dbl>
#>  1        350 spectrum00001.asd.sco 0.207     0
#>  2        350 spectrum00002.asd.sco 0.209     0
#>  3        350 spectrum00003.asd.sco 0.212     0
#>  4        350 spectrum00004.asd.sco 0.203     0
#>  5        350 spectrum00005.asd.sco 0.205     0
#>  6        350 spectrum00006.asd.sco 0.220     1
#>  7        350 spectrum00007.asd.sco 0.226     1
#>  8        350 spectrum00008.asd.sco 0.225     1
#>  9        350 spectrum00009.asd.sco 0.222     1
#> 10        350 spectrum00010.asd.sco 0.231     1
#> # … with 75,275 more rows
DF_stats <- DFlong |> group_by(Wavelength, Group) |> summarize(Avg = mean(Value))
#> `summarise()` has grouped output by 'Wavelength'. You can override using the
#> `.groups` argument.
DF_stats
#> # A tibble: 15,057 × 3
#> # Groups:   Wavelength [2,151]
#>    Wavelength Group    Avg
#>         <int> <dbl>  <dbl>
#>  1        350     0 0.207
#>  2        350     1 0.225
#>  3        350     2 0.187
#>  4        350     3 0.157
#>  5        350     4 0.147
#>  6        350     5 0.0979
#>  7        350     6 0.254
#>  8        351     0 0.207
#>  9        351     1 0.226
#> 10        351     2 0.191
#> # … with 15,047 more rows
``````

Created on 2023-05-04 with reprex v2.0.2

1 Like

Here's another approach, not as elegant. I created my own matrix of numbers.

c <- vector()
m <- matrix(runif(51) , nrow = 5)
n <- floor(ncol(m) / 5)
for (i in 1:n){
c[i] <- mean(m[,c((5*(i-1) + 1):(5*i))])
}
c

1 Like

Was the solution. Very fast and dinamic stepts. Tnks!

``````[1] 0.4632364      0.5403249