2 groups - multiple variables

tidyverse
rstudio
statistics

#1

Greetings,

My question is perhaps overly simple, which why I haven't found an answer, yet. Being a novice, and using R to replace SPSS, I just want to run some basic statistics. In particular, my biggest question at the moment is is how to code a t-test of Group (healthy vs patient) on 11 behavioral variables.? Do I have to code each t-test separately or is there a way to do this all at once? I've spent a lot of time on this and just coded each one separately, but this isn't ideal nor practical for future work. Any help or suggestions would be greatly appreciated.

Cheers,
Jason


#2

Hi! Could you please turn this into a self-contained reprex (short for reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

install.reprex("reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

What to do if you run into clipboard problems

If you run into problems with access to your clipboard, you can specify an outfile for the reprex, and then copy and paste the contents into the forum.

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

For pointers specific to the community site, check out the reprex FAQ, linked to below.


#3

Typically, you use an *apply() function (in base R) or one of the map_*() functions (tidyverse) to automate running the same function over a bunch of different inputs. The specifics of how you might do this depend a lot on how your data are arranged, however. Here's an example of what this might look like, but if your data are organized differently, it may not be directly applicable.

library(tidyverse)

# Create some example data
set.seed(42) # to make the example reproducible
study_data <- data.frame(
  group = factor(c(rep("healthy", 50), rep("patient", 50))),
  responseA = c(rnorm(50, mean = 20, sd = 2), rnorm(50, mean = 22, sd = 3)),
  responseB = c(rnorm(50, mean = 12, sd = 1), rnorm(50, mean = 22, sd = 1)),
  responseC = c(rnorm(50, mean = 2.5, sd = 1), rnorm(50, mean = 3.5, sd = 1)),
  responseD = c(rnorm(50, mean = 18, sd = 2), rnorm(50, mean = 18, sd = 3)),
  responseE = c(rnorm(50, mean = 0.54, sd = 1), rnorm(50, mean = 0.21, sd = 1)),
  responseF = c(rnorm(50, mean = 86.2, sd = 2), rnorm(50, mean = 74.3, sd = 1)),
  responseG = c(rnorm(50, mean = 4, sd = 2), rnorm(50, mean = 8, sd = 3))
)

13151_t_test
Aren't the results nice when I make up the data? :wink:

# All the t-tests, base R style
t_tests <- lapply(
  study_data[, -1], # apply the function to every variable *other than* the first one (group)
  function(x) { t.test(x ~ group, data = study_data) }
)

# All the t-tests, tidyverse style
t_tests_tidy <- map(
  select(study_data, -group),
  ~ t.test(.x ~ group, data = study_data)
)

# Same basic results, either way
t_tests$responseB
#> 
#>  Welch Two Sample t-test
#> 
#> data:  x by group
#> t = -55.861, df = 97.783, p-value < 2.2e-16
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -10.487324  -9.767745
#> sample estimates:
#> mean in group healthy mean in group patient 
#>              11.84875              21.97628
t_tests_tidy$responseB
#> 
#>  Welch Two Sample t-test
#> 
#> data:  .x by group
#> t = -55.861, df = 97.783, p-value < 2.2e-16
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -10.487324  -9.767745
#> sample estimates:
#> mean in group healthy mean in group patient 
#>              11.84875              21.97628

# Pull out specific test stats, etc.
t_tests$responseB$statistic
#>         t 
#> -55.86131

Created on 2018-08-24 by the reprex package (v0.2.0).

If this doesn't apply directly to how your data are organized, it will help a lot if you can do as @tbradley suggests and post a reproducible example with sample data and code.

Edited to add: Since this was fake data and about looping, I didn't do any sort of adjustment for all these comparisons I'm making. Anybody reading this in the future: please think hard about your choice of alpha (and maybe the suitability of this specific — or general — approach) before you merrily run umpty-million t-tests.


#4

In continuation of @jcblum's comment - Please remember to adjust your p-values if you do end up throwing t-tests at the data, see ?p.adjust :slightly_smiling_face:


#5

This is very helpful. Thank you.
However, what if my grouping variable is not the first variable? For example, in my data, 'ID' is the first variable and 'group' is the second. How would I choose variable 'x' as a grouping variable in your example?

Cheers,
Jason


#6

Thanks for the response and help. As you can see, my question is extremely basic. I am just looking for a way to simplify writing out each t.test if I have more than a couple of variables. @jcblum had a very helpful answer which would work great, but my grouping variable is not the first one in the list. Along these same lines, is coding an ANOVA as straight forward as a single t.test? I haven't found that answer yet either.

library(tidyverse)
library(reprex)
cars_dat <- mtcars

t.test(mpg~am, cars_dat)
t.test(disp~am, cars_dat) 	
t.test(hp~am, cars_dat) 		
t.test(drat~am, cars_dat)
t.test(wt~am, cars_dat) 
t.test(qsec~am, cars_dat)

Cheers,
Jason


#7

Thanks for posting a reprex for discussion! As I said, the details depend on how the data is organized, so it’s useful to all be talking about the same example. In the case you proposed, not only is the predictor variable not the first one, you also only want to use a subset of the variables as responses. In that case, you might do something like this:

library(tidyverse)

cars_dat <- mtcars

resp_vars <- c("mpg", "disp", "hp", "drat", "wt", "qsec")

t_tests_base <- lapply(
  cars_dat[resp_vars],
  function(x) { t.test(x ~ am, data = cars_dat) }
)

t_tests_base$disp
#> 
#>  Welch Two Sample t-test
#> 
#> data:  x by am
#> t = 4.1977, df = 29.258, p-value = 0.00023
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>   75.32779 218.36857
#> sample estimates:
#> mean in group 0 mean in group 1 
#>        290.3789        143.5308

t_tests_tidy <- map(
  select(cars_dat, resp_vars),
  ~ t.test(.x ~ am, data = cars_dat)
)

t_tests_tidy$disp
#> 
#>  Welch Two Sample t-test
#> 
#> data:  .x by am
#> t = 4.1977, df = 29.258, p-value = 0.00023
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>   75.32779 218.36857
#> sample estimates:
#> mean in group 0 mean in group 1 
#>        290.3789        143.5308

Created on 2018-08-26 by the reprex package (v0.2.0).

While you can treat R as a command-line controlled system for executing statistical procedures, where your goal is just to look up and run the “right” series of commands for a given task, you will get further faster if you start thinking about it as a language for statistical computing. Meaning that you familiarize yourself with the syntax that lets you express how to do things in R (for instance “take this data frame and select only these n columns from it”) so that you eventually gain the power to solve whatever challenges your data throws at you.

If you just want to translate your existing way of doing things into R terms (not a bad starting point when there is work to be done!), this book might help:


There’s also a DataCamp course from the same author:

If you want to start learning how to use R more fluently on its own terms, this thread has lots of good resources: What's your favorite intro to R?

As for ANOVA — is it as simple? If we’re talking how to run a bunch of ANOVAs on a data frame of variables, yes the same code pattern applies. If we’re talking about how to code the model itself, there are several ways to do it. ANOVA is an interesting case because there is an important difference between the sums of squares calculation preferred by pure statisticians and the one that has become conventional in biostats circles (driven partly by what SAS and SPSS decided to bake in). Base R, having been written by members of one community, defaults to a method that doesn’t please the other community. For discussion and approaches to coding One-Way ANOVA (you didn’t say what kind of model you were looking for), see:


#8

Thanks for the detailed response. I really appreciate the advice to start thinking about R as a comprehensive language, rather than mere syntax, which is how I used SPSS. Unfortunately, I'm currently caught between work-to-finish and tackling new analyses with R. So I'll mostly be posting about specific stats questions in the near future. That said, I am really looking forward to learning how to use R more comprehensively and efficiently in my entire workflow.

Regarding ANOVA, I was initially just thinking of testing more than two groups. Now that you’ve told me about the debate over how sums of squares are calculated, I had no idea, I have something else to look into.


#9

Been there! I think it’s rare that people have the luxury of un-pressured time to invest in skills. Sometimes expediency wins the battle of the day, and that’s perfectly okay. I find that just knowing that you’re eventually aiming beyond that helps with spotting the moments where you can squeeze in some longer term skill-building.

Good luck, and I look forward to your next question! (Obligatory link to some key guidelines: FAQ: Tips for writing R-related questions and FAQ: Tips for Introducing Non-Programming-Problem Discussions) :grinning: