# Normalizing with group and overall means using dplyr

Hi,
I'm trying to 'normalize' (best word I can come up with) some data using the ratio of the overall mean to the group mean which is then mutliplied by all the values. I'm trying to use dplyr, but having trouble knowing how to use both the group mean (using group_by ) and the overall mean (ungrouped) in the same formula.

For example:
set.seed(5)
df=data.frame(plate=c(rep(1:4,5),rep(5,3)), var1=rnorm(23),var2=rnorm(23),var3=rnorm(23))
df<-df[order(df\$plate),]
df

The values in each column (e.g., var1, var2) would be multiplied by the (overall mean / plate mean). So, for var1, the overall mean is

mean(df\$var1) #-0.1000353

# means by plate

aggregate(var1~plate, df, mean)

# plate var1

#1 1 -0.2185788
#2 2 -0.2843883
#3 3 -0.2661941
#4 4 -0.3530699
#5 5 1.1034477

So where plate==1, each var1 value would by multiplied by (-0.1000353 / -0.2185788 ), where plate==2, each value would be multiplied by (-0.1000353 /-0.2843883 ), and so on. This would be true for each variable. In theory I'd like this to work for mutliple variables, mutliple plates, with varying numbers of rows per plate (always more one row).

Thank you for any hints/help you can provide. I've had a difficult time trying to find relevant examples.
Nicolle

Hi Nicolle,

The basic strategy that I used to solve your problem (I think I did!) was to use `group_by` and `mutate` together to create the mean columns. Often, people think they can only use `group_by` with `summarize`. Really, `group_by` can be used with other functions that you want to work within each group.

For the case of looking at `var1`, here's what I came up with...

``````df %>%
mutate(mean_var1 = mean(var1)) %>%
group_by(plate) %>%
mutate(mean_var1_plate = mean(var1)) %>%
ungroup() %>%
mutate(var1_normalized = var1 * mean_var1/mean_var1_plate) %>%
select(plate, var1_normalized)
``````
``````# A tibble: 23 × 2
plate var1_normalized
<dbl>           <dbl>
1     1         -0.385
2     1          0.783
3     1         -0.131
4     1         -0.494
5     1         -0.273
6     2          0.487
7     2         -0.212
8     2          0.0486
9     2         -0.0554
10     2         -0.768
# … with 13 more rows
``````

Basically, the first mutate creates a column that's has the same value in every row that is the same as your `mean(df\$var1)`. Next we group by the plate and create another variable that is the mean of `var1` within that plate. This is the same as your `aggregate(var1~plate, df, mean)`. In my code both of these values are stored in `mean_var1` and `mean_var1_plate`. From there you can `ungroup` the data and use another mutate to calculate the normalized values of `var1`.

For the bigger question of how to do this for all of your variables, we take the same strategy, except we gather the five variable columns into two columns and add a nesting layer...

``````df %>%
mutate(observation = 1:nrow(.)) %>%
pivot_longer(-c(plate, observation), names_to="var", values_to="value") %>%
group_by(var) %>%
mutate(mean_var = mean(value)) %>%
group_by(plate, var) %>%
mutate(mean_var_plate = mean(value)) %>%
group_by(var) %>%
mutate(normalized = value * mean_var/mean_var_plate) %>%
ungroup() %>%
select(plate, observation, var, normalized) %>%
pivot_wider(id_cols=c(plate, observation), names_from=var, values_from=normalized) %>%
select(plate, starts_with("var"))
``````
``````# A tibble: 23 × 4
plate    var1    var2   var3
<dbl>   <dbl>   <dbl>  <dbl>
1     1 -0.385   0.286  -5.87
2     1  0.783   0.607   9.68
3     1 -0.131   0.450  -0.742
4     1 -0.494   0.386  -0.424
5     1 -0.273  -0.0578 -3.83
6     2  0.487   0.469  -1.16
7     2 -0.212  -0.376   0.474
8     2  0.0486  1.27   -2.33
9     2 -0.0554 -0.578   0.588
10     2 -0.768   0.888   1.25
# … with 13 more rows
``````

Aside from the `pivot_longer` the main difference was that I needed to add a column `observation` to `df` so that when we do the `pivot_wider` at the end each row has a unique name. Otherwise, `pivot_wider` complains. If this still doesn't make sense, I'd encourage you to run these two pipelines line by line and look at how the output changes as you go along.

Let me know if anything is still unclear!
Pat

1 Like

I am still learning mutate(across()) and took this as a challenge. Getting the overall mean and the mean for each plate group was easy, but connecting each of var1 to var3 with the corresponding overall mean and plate mean was tricky.

``````library(tidyverse)

set.seed(5)
df=data.frame(plate=c(rep(1:4,5),rep(5,3)), var1=rnorm(23),var2=rnorm(23),var3=rnorm(23))
df<-df[order(df\$plate),]

df %>% mutate(across(var1:var3, mean, .names = "overall_{.col}")) %>%
group_by(plate) %>%
mutate(across(var1:var3, mean, .names = "plate_{.col}")) %>%
ungroup() %>%
mutate(across(var1:var3, ~ .x *
get(glue::glue("overall_{cur_column()}")) /
get(glue::glue("plate_{cur_column()}")),
.names = "norm_{.col}")) %>%
select(plate, norm_var1:norm_var3)
#> # A tibble: 23 × 4
#>    plate norm_var1 norm_var2 norm_var3
#>    <dbl>     <dbl>     <dbl>     <dbl>
#>  1     1   -0.385     0.286     -5.87
#>  2     1    0.783     0.607      9.68
#>  3     1   -0.131     0.450     -0.742
#>  4     1   -0.494     0.386     -0.424
#>  5     1   -0.273    -0.0578    -3.83
#>  6     2    0.487     0.469     -1.16
#>  7     2   -0.212    -0.376      0.474
#>  8     2    0.0486    1.27      -2.33
#>  9     2   -0.0554   -0.578      0.588
#> 10     2   -0.768     0.888      1.25
#> # … with 13 more rows
``````

Created on 2021-08-17 by the reprex package (v2.0.1)

3 Likes

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.