Creating a new variable based on he mean or other mathematical function of existing variables?

HQuaza · May 22, 2019, 3:20pm

Hi here, I would like to know how to create a new variable that is based on the average/mean of a set of existing variables in my dataset. My data looks like this:

amb1 amb2 amb3
4 7 5
4 4 5
5 4 5
5 6 6
.........................

How do I create a new variable column based on the mean values of amb1, amb2, and amb3? I suppose I could simply type the entire formula for the mean in the following code, but I am sure there is a simpler way of using some kind of function command?

mutate(mscstart, amb =(amb1 + amb2 + amb3)/3)

Thanks!

martin.R · May 22, 2019, 3:54pm

You can use rowMeans() from base:
https://stat.ethz.ch/R-manual/R-devel/RHOME/library/base/html/colSums.html

Or rowwise() from dplyr:
https://dplyr.tidyverse.org/reference/rowwise.html

HQuaza · May 23, 2019, 3:40am

I am sorry, I am relatively new to R and am still struggling with the code based on the links you provided. I tried the following based on rowmeans but it does not work. I can't seem to select the variables I need which are amb1 to amb3. I forgot to mention that these columns are part of a larger dataset with other variables. Hence, I do have to select the specific variables to calculate their mean.

mscstart <-
mutate(mscstart,
amb = rowMeans(select(mscstart, starts_with("amb"))

amb1 amb2 amb3
4 7 5
4 4 5
5 4 5
5 6 6

Yarnabrina · May 23, 2019, 4:12am

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

mscstart <- read.table(text = 'amb1 amb2 amb3
4 7 5
4 4 5
5 4 5
5 6 6',
                       header = TRUE)

mscstart %>%
  mutate(amb = rowMeans(x = select(.data = .,
                                   starts_with(match = "amb"))))
#>   amb1 amb2 amb3      amb
#> 1    4    7    5 5.333333
#> 2    4    4    5 4.333333
#> 3    5    4    5 4.666667
#> 4    5    6    6 5.666667

# you can also do this, in case all columns start with "amb"
mscstart %>% mutate(amb = rowMeans(.))
#>   amb1 amb2 amb3      amb
#> 1    4    7    5 5.333333
#> 2    4    4    5 4.333333
#> 3    5    4    5 4.666667
#> 4    5    6    6 5.666667

^{Created on 2019-05-23 by the reprex package (v0.3.0)}

You were on the right track, just missed 2 brackets. Add two ) at the end, and bingo!

Hope this helps.

HQuaza · May 24, 2019, 3:10am

Thanks so much. Your code certainly works when the data is entered manually. However, I am still having issues when I am adding the new created "amb" variable it to an existing dataset containing other variables(See attached)

I specifically ran the following code:

mscstart %>%
           mutate(amb = rowMeans(x = select(.data =mscstart,
                                            starts_with(match = "amb"))))

Interestingly, "amb" appears in the list of variables in the tibble (see the very last variable that appears in the output) that appears in the console:

" A tibble: 127 x 88
      sn gender   age race  marital educ  generation `repeat` eatout cuisine clubjoin
   <dbl> <chr>  <dbl> <chr> <chr>   <chr> <chr>      <chr>     <dbl> <chr>   <chr>   
 1     1 F         73 Chin~ .........
..............
#   dish13 <dbl>, dish14 <dbl>, music <chr>, freeg <chr>, setmenu <chr>,
#   vmonlr <dbl>, gender.d <dbl>, race.d <dbl>, amb <dbl>"

However, when I proceed to check the the dataset using either view(mscstart) OR mscstart$amb, I get the following message:

"NULL
Warning message:
Unknown or uninitialised column: 'amb'."

Do you know what is wrong?

Thanks again.

Yarnabrina · May 24, 2019, 3:13am

It's not stored yet.

Try this:

mscstart <- mscstart %>%
mutate(amb = rowMeans(x = select(.data =mscstart,
starts_with(match = "amb"))))

Then try mscstart$amb.

On a separate note, please place your codes inside a pair of triple backticks (```). I edited your code, I hope you won't mind.

HQuaza · May 24, 2019, 3:56am

It works! Thanks! 2 more questions please:
(1) How to I select specific variables that do not begin with the same string? e.g. What if I wanted to calculate the mean based on amb1, amb2, and sercli1 and sercli2?

(2) How do I round off the calculated means to let's say 1 decimal place?

Thanks again!

mjfrigaard · May 24, 2019, 5:31am

Here are two possible solutions to your questions.

library(datapasta)
library(tidyverse)
#> Registered S3 methods overwritten by 'ggplot2':
#>   method         from 
#>   [.quosures     rlang
#>   c.quosures     rlang
#>   print.quosures rlang

# (1) How to I select specific variables that do not begin with the same 
# string? e.g. What if I wanted to calculate the mean based on amb1, amb2, 
# and sercli1 and sercli2?

# (2) How do I round off the calculated means to let's say 1 decimal place?

DF <- data.frame(
        amb1 = c(2, 3, 2, 4, 6, 7, 3, 5),
        amb2 = c(3, 2, 4, 1, 3, 5, 6, 4),
     sercli1 = c(4, 6, 3, 5, 5, 4, 4, 6),
     sercli2 = c(3, 3, 4, 6, 6, 6, 3, 4))

DF %>% 
    dplyr::mutate(means = rowMeans(
        # this is where you can drop a select
                              dplyr::select(., 
                                            # with a dplyr::one_of()
                                            dplyr::one_of(
                                                # then add the columns
                                  c('amb1','amb2','sercli1'))), 
                              # remove any missing
                              na.rm = TRUE),
        # round the mean value
                  means = base::round(means, digits = 2))
#>   amb1 amb2 sercli1 sercli2 means
#> 1    2    3       4       3  3.00
#> 2    3    2       6       3  3.67
#> 3    2    4       3       4  3.00
#> 4    4    1       5       6  3.33
#> 5    6    3       5       6  4.67
#> 6    7    5       4       6  5.33
#> 7    3    6       4       3  4.33
#> 8    5    4       6       4  5.00

Created on 2019-05-23 by the reprex package (v0.3.0)

Hope this works!

HQuaza · May 24, 2019, 9:14am

Thanks for the advice. I managed to run the dplyr code successfully. However, I then subsequently tried to calculate the descriptive statistics for the newly created variable ("sercli") using the following code in the "pastecs" package:

mscstart %>%
stat.desc (sercli)

(where mscstart is my imported data set)

but I got this error message instead:
"Error in stat.desc(., sercli) : object 'sercli' not found"

Any idea what I am doing wrong? Is there an alternative way to get the descriptive statistics for a selected variable?

Thanks!

mjfrigaard · May 24, 2019, 9:30am

# install.packages("pastecs")
library(pastecs)
library(tidyverse)

Get data from package (I don't have your data).

marphy <- pastecs::marphy %>% as_tibble()

Reproduce this problem (i.e. with new data)

marphy %>% pastecs::stat.desc(Salinity)
#> Error in pastecs::stat.desc(., Salinity): object 'Salinity' not found

We actually need the other pipe %$% from magrittr. This is usually a good backup if the traditional pipe %>% isn't working.

library(magrittr)
marphy %$% pastecs::stat.desc(Salinity)
#>      nbr.val     nbr.null       nbr.na          min          max 
#> 6.800000e+01 0.000000e+00 0.000000e+00 3.811600e+01 3.844800e+01 
#>        range          sum       median         mean      SE.mean 
#> 3.320000e-01 2.604111e+03 3.828350e+01 3.829575e+01 1.207984e-02 
#> CI.mean.0.95          var      std.dev     coef.var 
#> 2.411146e-02 9.922728e-03 9.961289e-02 2.601147e-03

Also check out skmir package (also needs rlang)

library(skimr)
library(rlang)
skimr::skim(.data = marphy)
#> Skim summary statistics
#>  n obs: 68 
#>  n variables: 4 
#> 
#> ── Variable type:numeric ────────────────────────────────────────────────
#>      variable missing complete  n  mean    sd    p0   p25   p50   p75
#>       Density       0       68 68 28.96 0.076 28.81 28.89 28.97 29.03
#>  Fluorescence       0       68 68  1.17 0.28   0.72  0.97  1.06  1.45
#>      Salinity       0       68 68 38.3  0.1   38.12 38.21 38.28 38.4 
#>   Temperature       0       68 68 13.01 0.081 12.82 12.93 13.03 13.07
#>   p100     hist
#>  29.06 ▃▃▂▂▅▃▆▇
#>   1.69 ▂▃▇▂▂▂▂▃
#>  38.45 ▃▆▃▅▅▂▇▆
#>  13.15 ▂▂▃▁▅▇▆▁

It provides loads of great summary stats.

Hope this helps!

^{Created on 2019-05-24 by the reprex package (v0.3.0)}

HQuaza · May 24, 2019, 2:45pm

Works really well. Thanks for introducing me to these great packages.

HQuaza · May 27, 2019, 7:18am

hi there, i have been playing around with my dataset in R more and have more queries if you don't mind. As before, I have named my dataset "mscstart).

I tried creating a new variable (e.g. vmon1r) based on the recode of an existing variable in my data. For instance, the formula I used was "vmon1r = 8 - vmon1" where vmon1 is the existing variable. Based on the previous examples, I used the following code:

mscstart <- mscstart %>%
mutate(mscstart,
vmonlr = 8 - vmon1)

I checked the my data using "view(mscstart)" and saw that an additional column was indeed created. I even see it in the summary I generated using "summary(mscdata). However, when I wanted to get the details of this new variable using "mscstart$vmon1r" , I received the following warning message:

"Warning message:
Unknown or uninitialised column: 'vmon1r'."

Any idea what is going on here?

mishabalyasin · May 27, 2019, 7:39am

You have a typo: vmonlr -> vmon1r

HQuaza · May 31, 2019, 8:14am

Oops. Thanks for pointing out my error.

system · June 21, 2019, 8:14am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.