How to average/mean variables in R based on the level of another variable (and save this as a new variable)?

EDresearcher · May 22, 2018, 7:35pm

I'm very new to R (and coding in general), and I'm using RStudio. I had a question about how create a new variable, that is an average value of another variable (but based on the level of a third variable).

I am doing a meta-analysis with my dataset, metacomplete_, and I'm trying to average effect-sizes (variable: *_selectedES.prepost_*) into one value per paper (variable Paper#). Basically, some papers have more than one effect-size and I want to average them, so each paper only has one effect size.
(I will later need to be able to manipulate this variable,).
I tried a lot of things but can't figure this out!!
My dataset: metacomplete.

The relevant variables I have in this dataset are:

Paper#(the number indicates which paper the entry came from)
selectedES.prepost (numerical variable for my separate effect sizes)
Thank you in advance!

jcblum · May 22, 2018, 11:43pm

Welcome! I'm afraid you'll need to supply some more info in order for helpers to be able to understand your problem (this is pretty common — when you're new to this stuff, it's hard to know how much information is enough!).

The best thing would be if you can make your question into a reproducible example (follow the link for instructions and explanations). To include your data, you'll want to follow one of the methods discussed here.

If you try all that and get stuck, here's a fallback option...

Edit your post and add in some of the code you have tried. It's OK it doesn't work! It's really helpful to see what you've been attempting. Be sure to format your code as code (it's really hard to read code that isn't formatted properly)
Include sample data:
- If your data set is OK to share, run the following line and paste the output into your post. Again, be sure to format it as code
```
dput(head(metacomplete, 10))
```
- If your data set can't be shared, run this line instead and paste the output into your post (and yes, format as code!) This will still share some information about your data. If it's truly confidential, I'm afraid you'll need to make a fake sample dataset to share.
```
str(metacomplete)
```

EDresearcher · May 23, 2018, 3:29am

Hello,

I’m adding more information to try to make my question/example reproducible

My variable names will be for this example:
Variable 1: Paper
Variable 2: selectedES.prepost
Need to create Variable 3: averaged.ES

Code for the data

#data for Paper
Paper= c(1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9 )

#data for selectedES.prepost
selectedES.prepost=c(0.0048, -0.1420, -0.3044, -1.3024, -0.4052, -0.6066, -0.1961, -1.1187, -0.4585, -0.8251, -0.5328, -1.3623, -0.5450, -0.4982, -0.5714, -0.8793, -0.3677, -0.3976, -0.6136, -0.7047, -0.8580, -0.5024, -0.8018, -0.8927, -0.3106, -0.5893, -0.6677, -1.6663, -1.1769, -0.8384, -0.5632, -0.5237, -0.3458, -0.9957, -0.5331, -0.7413, -0.0311, -0.4936, 0.5422, -0.0340)

#creating a test dataset
mydata <- data.frame(Paper,selectedES.prepost)

What I would like to do:

I would like to average the the selectedES.prepost variable, so that for each averages all the values of the Paper variable by levels. For example, for Paper = 1, it should average 0.0048, -0.1420, -0.3044, -1.3024, -0.4052-0.1961, & -1.1187 to get -.05088. Since I have nine unique values for Paper (1, 2, 3, 4, 5, 6, 7, 8, 9), I should get nine averages.
The averages should be the following bolded values (according to excel):

Paper selectedES.prepost averaged.ES
1 0.0048
1 -0.1420
1 -0.3044
1 -1.3024
1 -0.4052
1 -0.6066
1 -0.1961
1 -1.1187 -0.5088
2 -0.4585 -0.4585
3 -0.8251
3 -0.5328
3 -1.3623 -0.9067
4 -0.5450
4 -0.4982
4 -0.5714
4 -0.8793
4 -0.3677
4 -0.3976
4 -0.6136
4 -0.7047 -0.5722
5 -0.8580
5 -0.5024
5 -0.8018
5 -0.8927
5 -0.3106
5 -0.5893 -0.6591
6 -0.6677 -0.6677
7 -1.6663
7 -1.1769
7 -0.8384
7 -0.5632
7 -0.5237
7 -0.3458 -0.8524
8 -0.9957
8 -0.5331
8 -0.7413
8 -0.0311 -0.5753
9 -0.4936
9 0.5422
9 -0.0340 0.0049

jcblum · May 23, 2018, 3:54am

Thanks very much! You did great. Here's a tip: you can edit any of your own posts — look for the little gray pencil icon at the bottom of a post and click it to edit.

If you're having trouble with formatting as code, the simplest way is just to select all the text you need to format in the editing box and click the </> button at the top of the editing box. That will get you 90% of the way there (it will format the selected text as generic code, rather than specifically R code, but honestly that's good enough in most cases!)

EDresearcher · May 23, 2018, 4:04am

Thanks so much for your help!! I really appreciate this.

markdly · May 23, 2018, 4:28am

Does this do what you need?

#data for Paper
Paper <- c(1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9 )

#data for selectedES.prepost
selectedES.prepost <- c(0.0048, -0.1420, -0.3044, -1.3024, -0.4052, -0.6066, -0.1961, -1.1187, -0.4585, -0.8251, -0.5328, -1.3623, -0.5450, -0.4982, -0.5714, -0.8793, -0.3677, -0.3976, -0.6136, -0.7047, -0.8580, -0.5024, -0.8018, -0.8927, -0.3106, -0.5893, -0.6677, -1.6663, -1.1769, -0.8384, -0.5632, -0.5237, -0.3458, -0.9957, -0.5331, -0.7413, -0.0311, -0.4936, 0.5422, -0.0340)

#creating a test dataset
mydata <- data.frame(Paper, selectedES.prepost)

library(tidyverse)
mydata %>% 
  group_by(Paper) %>% 
  summarise(average = mean(selectedES.prepost))
#> # A tibble: 9 x 2
#>   Paper  average
#>   <dbl>    <dbl>
#> 1     1 -0.509  
#> 2     2 -0.458  
#> 3     3 -0.907  
#> 4     4 -0.572  
#> 5     5 -0.659  
#> 6     6 -0.668  
#> 7     7 -0.852  
#> 8     8 -0.575  
#> 9     9  0.00487

As an aside, in case you're wondering what this looks like as 'raw' text so the code shows up nicely I added this answer to this gist. (You'll need to click on the 'raw' tab to see the unformatted text).

Created on 2018-05-23 by the reprex package (v0.2.0).

jcblum · May 23, 2018, 4:31am

Here's a tidyverse approach to summarizing your data:

library(tidyverse)

Paper <- c(
  1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 4, 4, 4, 4, 
  4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 7, 7, 7, 7, 7, 
  7, 8, 8, 8, 8, 9, 9, 9
)

selectedES.prepost <- c(
  0.0048, -0.1420, -0.3044, -1.3024, -0.4052, -0.6066, -0.1961, 
  -1.1187, -0.4585, -0.8251, -0.5328, -1.3623, -0.5450, -0.4982, 
  -0.5714, -0.8793, -0.3677, -0.3976, -0.6136, -0.7047, -0.8580, 
  -0.5024, -0.8018, -0.8927, -0.3106, -0.5893, -0.6677, -1.6663, 
  -1.1769, -0.8384, -0.5632, -0.5237, -0.3458, -0.9957, -0.5331, 
  -0.7413, -0.0311, -0.4936, 0.5422, -0.0340
)

# creating a test dataset
mydata <- data.frame(Paper, selectedES.prepost)

mean_by_Paper <- mydata %>% 
  group_by(Paper) %>% 
  summarize(averaged.ES = mean(selectedES.prepost))

mean_by_Paper
#> # A tibble: 9 x 2
#>   Paper averaged.ES
#>   <dbl>       <dbl>
#> 1     1    -0.509  
#> 2     2    -0.458  
#> 3     3    -0.907  
#> 4     4    -0.572  
#> 5     5    -0.659  
#> 6     6    -0.668  
#> 7     7    -0.852  
#> 8     8    -0.575  
#> 9     9     0.00487

# You don't have to stop at the mean...
by_Paper <- mydata %>% 
  group_by(Paper) %>% 
  summarize(
    averaged.ES = mean(selectedES.prepost),
    sd.ES = sd(selectedES.prepost),
    n = n()
  )

by_Paper
#> # A tibble: 9 x 4
#>   Paper averaged.ES  sd.ES     n
#>   <dbl>       <dbl>  <dbl> <int>
#> 1     1    -0.509    0.472     8
#> 2     2    -0.458   NA         1
#> 3     3    -0.907    0.421     3
#> 4     4    -0.572    0.166     8
#> 5     5    -0.659    0.230     6
#> 6     6    -0.668   NA         1
#> 7     7    -0.852    0.493     6
#> 8     8    -0.575    0.409     4
#> 9     9     0.00487  0.519     3

Created on 2018-05-22 by the reprex package (v0.2.0).

Does that help?

jcblum · May 23, 2018, 4:40am

Ha! Our posts crossed in the ether!

@EDresearcher, if you notice that @markdly and I spelled summarise()/summarize() differently, don't be confused: the author of the dplyr package (where this function comes from) happens to have specifically designed the function to work with either spelling.

markdly · May 23, 2018, 4:48am

Yes I think should delete my answer! @jcblum's is much more complete (and also formatted properly!).

EDresearcher · May 23, 2018, 5:12pm

Wow! This was amazing!! Thanks so much to both of you!

I still am getting an error though, if I try to refer to this variables in my original full dataset.
My full dataset is called metacomplete, and does contain the variables Paper and selectedES.prepost

My code

meta.paper <- data.frame(metacomplete$Paper, metacomplete$selectedES.prepost)
#this created a new dataset, meta.paper, and named the variables "metacomplete.Paper" & "meta.complete.selectedES.prepost"
group_by(meta.paper$metacomplete.Paper) %>%
summarize(meta.paper$averaged.ES = mean(meta.paper$metacomplete.selectedES.prepost))

Then I got this error message:

Error: unexpected '=' in:
"group_by(meta.paper$metacomplete.Paper) %>%
summarize(meta.paper$averaged.ES ="

However, I can still use this code, when I create the variables outside my full dataset. I just don't know how to use this method for variables in my dataset.

Thanks again though!! This was extremely helpful!!

tbradley · May 23, 2018, 5:28pm

You should not use the $ inside of your dplyr code. The code you posted in your last example should look like this:

meta.paper %>%
  group_by(metacomplete.Paper) %>%
  summarize(averaged.ES = mean(metacomplete.selectedES.prepost))

taking it one step further you could do it like this:

metacomplete %>%
  select(Paper, selectedES.prepost) %>%
  group_by(Paper) %>%
  summarize(averaged.ES = mean(selectedES.prepost))

That way you do not need to create the data frame at the beginning

jcblum · May 23, 2018, 5:46pm

dplyr uses some tricks so that you don’t have to specify your data frame variables using the $ syntax (and in fact, it gets confused if you do, as you’ve seen!)

A nice thing about using dplyr is that it keeps your data transformations in the context of the data frame, and therefore in context of each other. When you operate on individual columns of a data frame that you’ve pulled out using $, you’ve actually copied the data in that variable out into a separate object that has lost its relationship with other variables in the data frame. When people are starting out, this pattern of work tends to result in lots of intermediate objects littering your environment and can lead to confusing bugs.

But I know R’s “there’s more than one way to do it” ethos can be a lot to take in when you’re getting started! I recommend checking out this chapter of R for Data Science to start internalizing dplyr‘s grammar:

And of course there’s the dplyr documentation site, and the tidyverse learning resources page.

And keep asking questions!

EDresearcher · May 23, 2018, 5:57pm

Oh, this is great!

Sorry, I am asking so many questions (as this is my first time using R for an analysis or doing any coding at all),

But using the method that @tbradley described, am I able to save averaged.ES as a variable in the meta.paper dataset? When I use this code it doesn't save averaged.ES

I'm learning so much already! I'm really impressed by the helpfulness and high-quality, well-informed responses I am getting. Thank you all!
I definitely will try to contribute myself to this forum after I gain more experience and learn more about R. Great community!

jcblum · May 23, 2018, 6:23pm

You have permission to ask as many questions as you need to — that’s one of the things this community is for, after all.

As you observed, @tbradley’s code takes the metacomplete data frame and passes it from function to function, making changes along the way — but the final result isn’t saved anywhere (it just gets printed as output). This is a safer thing to do when giving advice than to hand somebody code that might overwrite an object in their environment when you don’t know if they’ll understand that’s going to happen!

If you wanted to overwrite old metacomplete with the result of the summarization, you’d do this:

metacomplete <- metacomplete %>%
  select(Paper, selectedES.prepost) %>%
  group_by(Paper) %>%
  summarize(averaged.ES = mean(selectedES.prepost))

But I strongly suspect that in this case, you’ll prefer to save the result to a new dataframe, which you do like this:

metacomplete_summmary <- metacomplete %>%
  select(Paper, selectedES.prepost) %>%
  group_by(Paper) %>%
  summarize(averaged.ES = mean(selectedES.prepost))

Can you sort of see what’s going on now? I think of that last bit of code as being like telling R: “Please define metacomplete_summary as the result of (deep breath): starting with metacomplete, selecting the Paper and selectedES.prepost columns, grouping the rows by the value of Paper, then calculating a mean for each group and saving it in a new column called averaged.ES”

(yes, I like to imagine that I say my pleases-and-thank-yous when I talk to R )

EDresearcher · May 23, 2018, 6:56pm

Okay, yes!!

Thanks so much