How get mean by each 5 columns in a big data set?

M_AcostaCH · May 4, 2023, 4:48pm

Hi Commnity i'm have a data frame with many columns. I need to calculate the mean by each 5 column. The first column is the ID

For example, first mean are column 1 to 5, the second one is column 6 to 10.

The data link:

How make this for all 35 columns. I have othe data with 800 columns.

library(dplyr)
spectra |> group_by(Wavelength) |> 
  summarise(media= c(spectrum00001.asd.sco+spectrum00002.asd.sco+
                       spectrum00003.asd.sco+spectrum00004.asd.sco+
                       spectrum00001.asd.sco)/5) # this run well but is so difficult make one by one.

 spectra |> group_by(Wavelength) |>  # this run wrong 
    mutate(promedio = rowMeans(across(seq(1, ncol(spectra), by = 4),
                                      .names = "prom_{.col}")))

Tnks!

FJCC · May 4, 2023, 6:16pm

I would pivot the data to a long format, extract the spectrum number (e.g. 01,02, etc), make groups by integer division of those numbers by 5, and calculate the average of the groups for each wavelength.

library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)
DF <- read.csv("~/R/Play/Spectra_v4.csv")
DFlong <- DF |> pivot_longer(cols = -Wavelength, names_to = "Label", values_to = "Value")
DFlong <- DFlong |> mutate(Group = (as.numeric(str_sub(Label,12,13)) -1) %/% 5)
DFlong
#> # A tibble: 75,285 × 4
#>    Wavelength Label                 Value Group
#>         <int> <chr>                 <dbl> <dbl>
#>  1        350 spectrum00001.asd.sco 0.207     0
#>  2        350 spectrum00002.asd.sco 0.209     0
#>  3        350 spectrum00003.asd.sco 0.212     0
#>  4        350 spectrum00004.asd.sco 0.203     0
#>  5        350 spectrum00005.asd.sco 0.205     0
#>  6        350 spectrum00006.asd.sco 0.220     1
#>  7        350 spectrum00007.asd.sco 0.226     1
#>  8        350 spectrum00008.asd.sco 0.225     1
#>  9        350 spectrum00009.asd.sco 0.222     1
#> 10        350 spectrum00010.asd.sco 0.231     1
#> # … with 75,275 more rows
DF_stats <- DFlong |> group_by(Wavelength, Group) |> summarize(Avg = mean(Value))
#> `summarise()` has grouped output by 'Wavelength'. You can override using the
#> `.groups` argument.
DF_stats
#> # A tibble: 15,057 × 3
#> # Groups:   Wavelength [2,151]
#>    Wavelength Group    Avg
#>         <int> <dbl>  <dbl>
#>  1        350     0 0.207 
#>  2        350     1 0.225 
#>  3        350     2 0.187 
#>  4        350     3 0.157 
#>  5        350     4 0.147 
#>  6        350     5 0.0979
#>  7        350     6 0.254 
#>  8        351     0 0.207 
#>  9        351     1 0.226 
#> 10        351     2 0.191 
#> # … with 15,047 more rows

^{Created on 2023-05-04 with reprex v2.0.2}

fcas80 · May 4, 2023, 7:07pm

Here's another approach, not as elegant. I created my own matrix of numbers.

c <- vector()
m <- matrix(runif(51) , nrow = 5)
n <- floor(ncol(m) / 5)
for (i in 1:n){
c[i] <- mean(m[,c((5*(i-1) + 1):(5*i))])
}
c

M_AcostaCH · May 4, 2023, 7:43pm

Was the solution. Very fast and dinamic stepts. Tnks!

M_AcostaCH · May 4, 2023, 7:45pm

Im run but the result was this:

[1] 0.4632364      0.5403249

system · May 11, 2023, 7:46pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.