# Creating a new variable based on he mean or other mathematical function of existing variables?

Hi here, I would like to know how to create a new variable that is based on the average/mean of a set of existing variables in my dataset. My data looks like this:

amb1 amb2 amb3
4 7 5
4 4 5
5 4 5
5 6 6
.........................

How do I create a new variable column based on the mean values of amb1, amb2, and amb3? I suppose I could simply type the entire formula for the mean in the following code, but I am sure there is a simpler way of using some kind of function command?

mutate(mscstart, amb =(amb1 + amb2 + amb3)/3)

Thanks!

You can use `rowMeans()` from base:
https://stat.ethz.ch/R-manual/R-devel/RHOME/library/base/html/colSums.html

Or `rowwise()` from dplyr:
https://dplyr.tidyverse.org/reference/rowwise.html

I am sorry, I am relatively new to R and am still struggling with the code based on the links you provided. I tried the following based on rowmeans but it does not work. I can't seem to select the variables I need which are amb1 to amb3. I forgot to mention that these columns are part of a larger dataset with other variables. Hence, I do have to select the specific variables to calculate their mean.

mscstart <-
mutate(mscstart,
amb = rowMeans(select(mscstart, starts_with("amb"))

amb1 amb2 amb3
4 7 5
4 4 5
5 4 5
5 6 6

``````library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#>     filter, lag
#> The following objects are masked from 'package:base':
#>
#>     intersect, setdiff, setequal, union

mscstart <- read.table(text = 'amb1 amb2 amb3
4 7 5
4 4 5
5 4 5
5 6 6',

mscstart %>%
mutate(amb = rowMeans(x = select(.data = .,
starts_with(match = "amb"))))
#>   amb1 amb2 amb3      amb
#> 1    4    7    5 5.333333
#> 2    4    4    5 4.333333
#> 3    5    4    5 4.666667
#> 4    5    6    6 5.666667

# you can also do this, in case all columns start with "amb"
mscstart %>% mutate(amb = rowMeans(.))
#>   amb1 amb2 amb3      amb
#> 1    4    7    5 5.333333
#> 2    4    4    5 4.333333
#> 3    5    4    5 4.666667
#> 4    5    6    6 5.666667
``````

Created on 2019-05-23 by the reprex package (v0.3.0)

You were on the right track, just missed 2 brackets. Add two `)` at the end, and bingo!

Hope this helps.

Thanks so much. Your code certainly works when the data is entered manually. However, I am still having issues when I am adding the new created "amb" variable it to an existing dataset containing other variables(See attached)

I specifically ran the following code:

``````mscstart %>%
mutate(amb = rowMeans(x = select(.data =mscstart,
starts_with(match = "amb"))))
``````

Interestingly, "amb" appears in the list of variables in the tibble (see the very last variable that appears in the output) that appears in the console:

``````" A tibble: 127 x 88
sn gender   age race  marital educ  generation `repeat` eatout cuisine clubjoin
<dbl> <chr>  <dbl> <chr> <chr>   <chr> <chr>      <chr>     <dbl> <chr>   <chr>
1     1 F         73 Chin~ .........
..............
#   dish13 <dbl>, dish14 <dbl>, music <chr>, freeg <chr>, setmenu <chr>,
#   vmonlr <dbl>, gender.d <dbl>, race.d <dbl>, amb <dbl>"
``````

However, when I proceed to check the the dataset using either `view(mscstart)` OR `mscstart\$amb`, I get the following message:

"NULL
Warning message:
Unknown or uninitialised column: 'amb'."

Do you know what is wrong?

Thanks again.

It's not stored yet.

Try this:

``````mscstart <- mscstart %>%
mutate(amb = rowMeans(x = select(.data =mscstart,
starts_with(match = "amb"))))
``````

Then try `mscstart\$amb`.

On a separate note, please place your codes inside a pair of triple backticks (```). I edited your code, I hope you won't mind.

1 Like

It works! Thanks! 2 more questions please:
(1) How to I select specific variables that do not begin with the same string? e.g. What if I wanted to calculate the mean based on amb1, amb2, and sercli1 and sercli2?

(2) How do I round off the calculated means to let's say 1 decimal place?

Thanks again!

Here are two possible solutions to your questions.

``````library(datapasta)
library(tidyverse)
#> Registered S3 methods overwritten by 'ggplot2':
#>   method         from
#>   [.quosures     rlang
#>   c.quosures     rlang
#>   print.quosures rlang

# (1) How to I select specific variables that do not begin with the same
# string? e.g. What if I wanted to calculate the mean based on amb1, amb2,
# and sercli1 and sercli2?

# (2) How do I round off the calculated means to let's say 1 decimal place?

DF <- data.frame(
amb1 = c(2, 3, 2, 4, 6, 7, 3, 5),
amb2 = c(3, 2, 4, 1, 3, 5, 6, 4),
sercli1 = c(4, 6, 3, 5, 5, 4, 4, 6),
sercli2 = c(3, 3, 4, 6, 6, 6, 3, 4))

DF %>%
dplyr::mutate(means = rowMeans(
# this is where you can drop a select
dplyr::select(.,
# with a dplyr::one_of()
dplyr::one_of(
c('amb1','amb2','sercli1'))),
# remove any missing
na.rm = TRUE),
# round the mean value
means = base::round(means, digits = 2))
#>   amb1 amb2 sercli1 sercli2 means
#> 1    2    3       4       3  3.00
#> 2    3    2       6       3  3.67
#> 3    2    4       3       4  3.00
#> 4    4    1       5       6  3.33
#> 5    6    3       5       6  4.67
#> 6    7    5       4       6  5.33
#> 7    3    6       4       3  4.33
#> 8    5    4       6       4  5.00
``````

Created on 2019-05-23 by the reprex package (v0.3.0)

Hope this works!

Thanks for the advice. I managed to run the dplyr code successfully. However, I then subsequently tried to calculate the descriptive statistics for the newly created variable ("sercli") using the following code in the "pastecs" package:

mscstart %>%
stat.desc (sercli)

(where mscstart is my imported data set)

but I got this error message instead:

Any idea what I am doing wrong? Is there an alternative way to get the descriptive statistics for a selected variable?

Thanks!

``````# install.packages("pastecs")
library(pastecs)
library(tidyverse)
``````

Get data from package (I don't have your data).

``````marphy <- pastecs::marphy %>% as_tibble()
``````

Reproduce this problem (i.e. with new data)

``````marphy %>% pastecs::stat.desc(Salinity)
``````

We actually need the other pipe `%\$%` from `magrittr`. This is usually a good backup if the traditional pipe `%>%` isn't working.

``````library(magrittr)
marphy %\$% pastecs::stat.desc(Salinity)
#>      nbr.val     nbr.null       nbr.na          min          max
#> 6.800000e+01 0.000000e+00 0.000000e+00 3.811600e+01 3.844800e+01
#>        range          sum       median         mean      SE.mean
#> 3.320000e-01 2.604111e+03 3.828350e+01 3.829575e+01 1.207984e-02
#> CI.mean.0.95          var      std.dev     coef.var
#> 2.411146e-02 9.922728e-03 9.961289e-02 2.601147e-03
``````

Also check out `skmir` package (also needs `rlang`)

``````library(skimr)
library(rlang)
skimr::skim(.data = marphy)
#> Skim summary statistics
#>  n obs: 68
#>  n variables: 4
#>
#> ── Variable type:numeric ────────────────────────────────────────────────
#>      variable missing complete  n  mean    sd    p0   p25   p50   p75
#>       Density       0       68 68 28.96 0.076 28.81 28.89 28.97 29.03
#>  Fluorescence       0       68 68  1.17 0.28   0.72  0.97  1.06  1.45
#>      Salinity       0       68 68 38.3  0.1   38.12 38.21 38.28 38.4
#>   Temperature       0       68 68 13.01 0.081 12.82 12.93 13.03 13.07
#>   p100     hist
#>  29.06 ▃▃▂▂▅▃▆▇
#>   1.69 ▂▃▇▂▂▂▂▃
#>  38.45 ▃▆▃▅▅▂▇▆
#>  13.15 ▂▂▃▁▅▇▆▁
``````

It provides loads of great summary stats.

Hope this helps!

Created on 2019-05-24 by the reprex package (v0.3.0)

1 Like

Works really well. Thanks for introducing me to these great packages.

hi there, i have been playing around with my dataset in R more and have more queries if you don't mind. As before, I have named my dataset "mscstart).

I tried creating a new variable (e.g. vmon1r) based on the recode of an existing variable in my data. For instance, the formula I used was "vmon1r = 8 - vmon1" where vmon1 is the existing variable. Based on the previous examples, I used the following code:

mscstart <- mscstart %>%
mutate(mscstart,
vmonlr = 8 - vmon1)

I checked the my data using "view(mscstart)" and saw that an additional column was indeed created. I even see it in the summary I generated using "summary(mscdata). However, when I wanted to get the details of this new variable using "mscstart\$vmon1r" , I received the following warning message:

"Warning message:
Unknown or uninitialised column: 'vmon1r'."

Any idea what is going on here?

You have a typo: `vmonlr` -> `vmon1r`

1 Like

Oops. Thanks for pointing out my error.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.