How to average a subset of data based on other columns

rneville11 · March 6, 2019, 1:45am

Hi! I have a question about how to subset data to then achieve an average of a specific group of numbers.
For instance:

block<-c('T1', 'T1', 'T1', 'T1','T2', 'T2', 'T2', 'T2')
plot<-c('1', '1', '1', '1','1', '1', '1', '1')
subplot<-c('NO', 'NI', 'EO', 'EI','NO', 'NI', 'EO', 'EI')
data1<-c(0,2,4,2,3,5,2,4)
data2<-c(1,1,1,7,2,2,3,1)

orig.DF<-data.frame(block, plot, subplot, data1, data2)

So I have block, plot, and subplot as my three levels. I want to have a single average for each plot. So I need to combine the data for each subplot to have a single average value for data1 and data2 that represent a block and plot.
I want a dataframe that looks like this:
block plot data1 data2
1 1 2 2.5
2 1 3.5 2

I can probably use the subset or aggregate functions but I'm not sure how to syntax this!
Please let me know if you need any additional information! Thank you!

andresrcs · March 6, 2019, 1:59am

You can do it with the dplyr package

orig.DF <- data.frame(stringsAsFactors = FALSE,
                      data1 = c(0, 2, 4, 2, 3, 5, 2, 4),
                      data2 = c(1, 1, 1, 7, 2, 2, 3, 1),
                      block = as.factor(c("T1", "T1", "T1", "T1", "T2", "T2", "T2", "T2")),
                      plot = as.factor(c("1", "1", "1", "1", "1", "1", "1", "1")),
                      subplot = as.factor(c("NO", "NI", "EO", "EI", "NO", "NI", "EO", "EI"))
)

library(dplyr)

orig.DF %>% 
    group_by(block, plot) %>% 
    summarise(mean_data1 = mean(data1), mean_data2 = mean(data2))
#> # A tibble: 2 x 4
#> # Groups:   block [2]
#>   block plot  mean_data1 mean_data2
#>   <fct> <fct>      <dbl>      <dbl>
#> 1 T1    1            2          2.5
#> 2 T2    1            3.5        2

^{Created on 2019-03-06 by the reprex package (v0.2.1)}

You can learn how to use dplyr and the rest of the tidyverse by reading this free ebook

rneville11 · March 6, 2019, 2:06am

Thank you! That's exactly what I want. What if instead of just 'data1' and 'data2' I have like 30 different columns, is there a shorter way to do that? Like specifying column (4:34) or something?

andresrcs · March 6, 2019, 2:10am

You can use summarise_if() or summarise_at(), this is an example that applies mean() function to all the numeric variables.

orig.DF %>% 
    group_by(block, plot) %>% 
    summarise_if(is.numeric, mean)
#> # A tibble: 2 x 4
#> # Groups:   block [2]
#>   block plot  data1 data2
#>   <fct> <fct> <dbl> <dbl>
#> 1 T1    1       2     2.5
#> 2 T2    1       3.5   2

rneville11 · March 6, 2019, 2:18am

Perfect! Worked like a charm. Thank you so much.

Eventually what I need to do is also average groups of columns. So say I have data1:data20. And then doing the same thing as above, I then need have average 1:4, 5:8, 9:12, and so on. Any ideas?

andresrcs · March 6, 2019, 2:39am

You can also perform rowwise operations but this is not the preferred method becuse it's "untidy".

orig.DF %>% 
    group_by(block, plot) %>% 
    summarise_if(is.numeric, mean) %>%
    rowwise() %>% 
    mutate(mean_1_2 = mean(c(data1,data2)))
#> Source: local data frame [2 x 5]
#> Groups: <by row>
#> 
#> # A tibble: 2 x 5
#>   block plot  data1 data2 mean_1_2
#>   <fct> <fct> <dbl> <dbl>    <dbl>
#> 1 T1    1       2     2.5     2.25
#> 2 T2    1       3.5   2       2.75

I recommend you to open a new topic about this but providing a relevant REPRoducible EXample (reprex), that way we can better understand the structure of your dataset and provide more suitable help.

If you've never heard of a reprex before, you might want to start by reading this FAQ:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

system · March 13, 2019, 2:39am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.