Iteration with aov/t.test

Dear experts of R

I am really struggling against all the packages (purrr, tidyverse (summarise, mutate..))
It is impossible to me to write a code to perform iterative analysis, and I posted more than one post but nobody has known how to solve it more than the tidy part (at least with my database and map function/summarize). I've tried so much stuff that I am goint to reduce the actions

Let say I have "h"

tibble [1,683 x 51] (S3: tbl_df/tbl/data.frame)
 $ paciente    : num [1:1683] 6430 6430 6430 6494 6494 ...
  ..- attr(*, "format.spss")= chr "F5.0"
 $ time        : Factor w/ 3 levels "00","01","66": 1 3 2 1 3 2 1 3 2 1 ...
 $ peso1       : num [1:1683] 115.4 110 114.5 76.2 73.6 ...
 $ cintura1    : num [1:1683] 123 118 NA 106 104 ...
 $ tasis2_e    : num [1:1683] 139 147 NA 129 136 NA 136 138 NA 138 ...
 $ tadias2_e   : num [1:1683] 78 85 NA 63 71 NA 71 72 NA 76 ...

The object "h" has 3 levels of time (factor), with an ID (patient). The goal is compare values across time. The "h" object comes from

h <- BBDD_PPLUS %>% 
  select(paciente, matches("_v00|_v01|_v66")) %>% 
  pivot_longer(-paciente) %>% 
  separate(name, into=c("name", "time"), sep="_v") %>% 
  pivot_wider(names_from= name, values_from=value)

I have tried with map function creating a function containing aov but several error came up, so I incline towards:

 h %>% 
+     dplyr::summarise(across(output = ~aov(.x ~ time), .cols = everything()))

It is not working at all, because when I perform

identical(h, k)

TRUE

I have also tried

BBDD_PPLUS %>% 
    select(paciente, matches("_v00|_v01|_v66")) %>% 
    pivot_longer(-paciente) %>% 
    separate(name, into=c("name", "time"), sep="_v") %>% 
    pivot_wider(names_from= name, values_from=value) %>% 
    group_by(time) %>% mutate(time = factor(time)) %>% 
    dplyr::filter(time == "00" | time ==  "01" | time ==  "66") %>%
    as.data.frame() %>% 
    map(.x = ., .f = ~aov(.x ~ time))

Error in model.frame.default(formula = .x ~ time, drop.unused.levels = TRUE) : variable lengths differ (found for 'time')

BBDD_PPLUS %>% 
    select(paciente, matches("_v00|_v01|_v66")) %>% 
    pivot_longer(-paciente) %>% 
    separate(name, into=c("name", "time"), sep="_v") %>% 
    pivot_wider(names_from= name, values_from=value) %>% 
    group_by(time) %>% mutate(time = factor(time)) %>% 
    dplyr::filter(time == "00" | time ==  "01" | time ==  "66") %>%
    as.data.frame() %>% 
    map(.x = is.numeric, .f = ~aov(.x ~ time))

Error: .x must be a vector, not a primitive function

As a precursor to finding the best code to solve your problem, I want to offer you help in providing the forum with example data, that forum users can easily access, if they are inclined to look at your issue.

In general you want to find a way to communicate a small representative dataset in a forum friendly way. I will show you an example of how to do this using the built in iris dataset, and on the assumption that you are confident that your h dataset is a reasonable starting point from which to do multiple aov's this should help you share a version of h with us.

ok, consider iris, it is 150 records, 50 from each species, and with 4 additional columns of info.
lets say we only want Species and 2 columns, we use select to pick them. Lets say we dont want 150 rows, but we do want 20 rows of each Species factor type, we can use group_by and slice_sample functions. finally we can convert the data.frame into a forum friendly copy and pasteable representation to share.
e.g.

library(tidyverse)

iris %>% select(Species,Sepal.Length,Sepal.Width) %>% 
  group_by(Species) %>% 
  slice_sample(n = 20,replace=TRUE) %>% ungroup() %>% dput()

resulting in us being able to share to the forum

mydf <- structure(list(Species = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L), .Label = c("setosa", "versicolor", "virginica"
), class = "factor"), Sepal.Length = c(5.7, 5.4, 4.6, 4.4, 5.3, 
5.1, 5, 5.1, 4.6, 5.1, 5.1, 4.9, 4.8, 5, 4.9, 4.8, 5, 5.4, 4.9, 
5.2, 6, 6, 7, 6.6, 5.5, 5.9, 5.9, 5.7, 6.1, 5.5, 5.5, 6.9, 5.6, 
6.2, 5.7, 5.7, 6.2, 5, 5.6, 6, 6.4, 6.4, 7.4, 6, 6.8, 6.5, 6.9, 
6.3, 5.8, 6.4, 6.7, 6.7, 6.3, 6.3, 6.5, 6.2, 6.7, 6.7, 7.2, 7.2
), Sepal.Width = c(4.4, 3.9, 3.2, 3, 3.7, 3.3, 3.3, 3.7, 3.6, 
3.5, 3.3, 3, 3, 3, 3.1, 3, 3.6, 3.7, 3.6, 3.4, 2.2, 2.9, 3.2, 
2.9, 2.3, 3.2, 3, 2.6, 2.8, 2.3, 2.3, 3.1, 2.9, 2.2, 3, 2.6, 
2.2, 2.3, 2.5, 2.7, 3.1, 2.8, 2.8, 2.2, 3.2, 3, 3.2, 2.9, 2.8, 
2.7, 3, 2.5, 2.7, 2.9, 3.2, 3.4, 3.3, 3.3, 3, 3)), row.names = c(NA, 
-60L), class = c("tbl_df", "tbl", "data.frame"))

Therefore perhaps try:


h %>%
  select(
    paciente,
    time,
    peso1,
    cintura1,
    tasis2_e,
    tadias2_e
  ) %>%
  group_by(time) %>%
  slice_sample(n = 20, replace = TRUE) %>%
  ungroup() %>%
  dput()

and share this output with us

First of all thank you, because I've never seen how to paste my data frame to make it available to other users and then make easier help me out

h<-structure(list(time = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L), .Label = c("00", "01", "66"), class = "factor"), 
    peso1 = c(95.9, NA, 103.4, NA, 88.9, 125.9, NA, 93.4, 77.3, 
    NA, 100.7, 88, 69.7, NA, NA, 74.6, 69.4, 86.8, 83.1, NA, 
    85, NA, 74.6, NA, 92.6, NA, 75, 82.8, 93, 114.7, 95.9, NA, 
    86.8, NA, 59.5, 99, 94.5, 92.9, NA, 74.3, 77.8, 102.6, 116, 
    NA, 105.5, NA, 63.2, NA, 70.3, 90.2, 89, NA, 79, 75.2, 116, 
    79, 80.9, NA, 76.3, NA), cintura1 = c(113, NA, 119, NA, 108, 
    131.8, NA, 110, 104, NA, 116.2, 103, 102.5, NA, NA, 101.5, 
    95, 107, 110, NA, 103.5, NA, 100, NA, 115, NA, 98, 108, 110, 
    120, 113, NA, 118, NA, 88, NA, 120, 114, NA, 104, 100, 120, 
    122, NA, 117, NA, 88, NA, 98, 112.5, 113, NA, 108, 105.5, 
    122, 103, 106.8, NA, 104.5, NA), tasis2_e = c(131, NA, 137, 
    NA, 138, 161, NA, 138, 146, NA, 117, 115, 176, NA, NA, 146, 
    131, 145, 146, NA, 114, NA, 121, NA, 138, NA, 137, 115, 152, 
    100, 128, NA, 135, NA, 122, NA, 124, 154, NA, 123, 105, 139, 
    144, NA, 145, NA, 130, NA, 111, 107, NA, NA, 122, 145, 144, 
    131, 138, NA, 145, NA)), row.names = c(NA, -60L), class = c("tbl_df", 
"tbl", "data.frame"))

Then, I have numerical variables (NAs present), including the ID variable (paciente). I have the time variable as a factor, to compare the quantitative columns (peso1 (weight), cintura1 (waist), tasis2_e (systolic pressure)..) along time. The time variable has 3 factors:

  • 00: basal measurement
  • 66: + 6 months
    -01: 12 months

Then I tried several actions to perform iteratively (preferrably with map functions) statistical analysis comparing variables within time.

t.test- iterative

h %>% filter(time == "00" | time== "01") %>% map2_dbl(.x = .x$time = 00, .y == .y$time = 01, .f = t.test( , na.rm = TRUE))

I don't know how to make the syntax work, I've read all the suggerence about placeholders with map and tidyverse packages, but I do not find an example like mine, and sticking to the theory is not working. How to tell that I need to execute from column3 to column50 t.test between all the combinations (00vs 01, 00 vs 66, 66 vs01)

I am trying with aov as well, thiniking that not having to name the levels of time here is going to be easy to make it work

h %>% 
map_dbl(is.numeric, .f = ~aov(.x ~ time))

Error: .x must be a vector, not a primitive function

However it is working with more simple functions such a mean

h %>% map_dfr(is.numeric, .f = ~ mean(.x, na.rm = TRUE))

Thanks in advance for all the effort in helping me out

lets see what we think of t.test before we move onto aov.
I made an assumption that you only want to keep the p.value result of the t.test, let me know if thats incorrect.

library(tidyverse)
#it seems when you have na's in one field, you have them in other fields, so there doesnt seem a purpose to keeping them...
h_no_na <- na.omit(h)

#what times are in the data ?
(time_contents <- unique(h_no_na$time))

#make pairs of times 
(combinations_of_time <- combn(x = time_contents,
                               m=2,
                               simplify = FALSE))
# come up with a naming convention for them
(test_names <- map_chr(combinations_of_time,
                       ~paste0(as.character(.x),collapse="_compare_")))

#for each combination of time, filter for that time on a left frame, and filter for the other for a right frame and compare across all other variables
# that are not time or Id (paciente)




(my_ttest_results <- map_dfr(combinations_of_time,
    ~{
      left_frame <- h_no_na %>% filter(time == .x[[1]])
      right_frame <- h_no_na %>% filter(time==.x[[2]])
      varnames <- setdiff(names(left_frame),c("time","paciente"))
      cat("Now processing time combination ", as.character(.x),"\n")
      results_for_all_vars <- map_dbl(varnames,
                                  ~{
                                    cat("Now processing variable ", .x,"\n")
                                    t.test(x=left_frame[[.x]],
                                           y=right_frame[[.x]])$p.value
                                  }) %>% set_names(paste0("pvalue_for_",varnames))
      
    }))

my_ttest_results$test_names <- test_names
my_ttest_results

Thanks in advance

  • Well, I want to keep the other statistics as well, not only pvalue
  • Not sure why you omit na's for the operations but if I do not take out the NA's is giving me "Not enough observations" warning
  • On the other hand, something must be wrong. I have done calculus manually and p.value I get for peso 66 vs 01 is 0.5621. It does not match with the result obtained from the code (either with cintura and tasis). Not sure what is wrong about the syntax here but is not working well

Thank you so much for the help. This code is hard for me to understand.

In the below I retain more than only the pvalue, I retain the full t.test result, without knowing what you do or dont want from this object into a data.frame, I wont go further than that.
I omitted NA's because you have entire rows with nothing but NA's. therefore what is the value of these rows ?
regardless in the below I include them. I don't know what this means for your analysis. Furthermore, omitting the NA's on the sample you provided does not result in the warning you shared, this makes it clear that your own data rather the sample you provided to me is being evaluated (and I don't have access to that to analyse it). The most natural explanation would be is that you have some column so heavily poisoned with NA's that omitting NA's from the data.frame gets rid of more than gets eliminated in the column restricted sample you shared.

I don't know whether your manual p.value calculation was limited to the data in the sample you provided or to your true data...
If you manually calculate the pvalue for

t.test(1:10, y = c(7:20, 200))

do you get 0.1245 ?

rewritten solution :

#what times are in the data ?
(time_contents <- unique(h$time))

#make pairs of times 
(combinations_of_time <- combn(x = time_contents,
                               m=2,
                               simplify = FALSE))
# come up with a naming convention for them
(test_names <- map_chr(combinations_of_time,
                       ~paste0("ttest_",as.character(.x),collapse="_compare_")))

#for each combination of time, filter for that time on a left frame, and filter for the other for a right frame and compare across all other variables
# that are not time or Id (paciente)




(my_ttest_results <- map(combinations_of_time,
                             ~{
                               left_frame <- h %>% filter(time == .x[[1]])
                               right_frame <- h %>% filter(time==.x[[2]])
                               varnames <- setdiff(names(left_frame),c("time","paciente"))
                               cat("Now processing time combination ", as.character(.x),"\n")
                               results_for_all_vars <- map(varnames,
                                                               ~{
                                                                 cat("Now processing variable ", .x,"\n")
                                                                 t.test(x=left_frame[[.x]],
                                                                        y=right_frame[[.x]])
                                                               }) %>% set_names(paste0("ttest_results_for_",varnames))
                               
                             }))

names(my_ttest_results) <- test_names
my_ttest_results

Perhaps I haven't explained myself correctly

  • "In the below I retain more than only the pvalue, I retain the full t.test result, without knowing what you do or dont want from this object into a data.frame, I wont go further than that". Perfect
  • About the NA's. Not important to the analysis but YES for the context. The clinical interventions must be quantified along time, and the NA's gives information about this. Totally agree on the pure mathematical aspect of the analysis
  • I was saying that NOT omitting NA's in the previous is giving the error "Not enough observations" WITH my true database. Nothing to add.
  • About the results of comparing manually the t.test against each other. I am repeating again with the data I provided (NOT with my true database) just the way I posted
h<-structure(list(time = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L), .Label = c("00", "01", "66"), class = "factor"), 
    peso1 = c(95.9, NA, 103.4, NA, 88.9, 125.9, NA, 93.4, 77.3, 
    NA, 100.7, 88, 69.7, NA, NA, 74.6, 69.4, 86.8, 83.1, NA, 
    85, NA, 74.6, NA, 92.6, NA, 75, 82.8, 93, 114.7, 95.9, NA, 
    86.8, NA, 59.5, 99, 94.5, 92.9, NA, 74.3, 77.8, 102.6, 116, 
    NA, 105.5, NA, 63.2, NA, 70.3, 90.2, 89, NA, 79, 75.2, 116, 
    79, 80.9, NA, 76.3, NA), cintura1 = c(113, NA, 119, NA, 108, 
    131.8, NA, 110, 104, NA, 116.2, 103, 102.5, NA, NA, 101.5, 
    95, 107, 110, NA, 103.5, NA, 100, NA, 115, NA, 98, 108, 110, 
    120, 113, NA, 118, NA, 88, NA, 120, 114, NA, 104, 100, 120, 
    122, NA, 117, NA, 88, NA, 98, 112.5, 113, NA, 108, 105.5, 
    122, 103, 106.8, NA, 104.5, NA), tasis2_e = c(131, NA, 137, 
    NA, 138, 161, NA, 138, 146, NA, 117, 115, 176, NA, NA, 146, 
    131, 145, 146, NA, 114, NA, 121, NA, 138, NA, 137, 115, 152, 
    100, 128, NA, 135, NA, 122, NA, 124, 154, NA, 123, 105, 139, 
    144, NA, 145, NA, 130, NA, 111, 107, NA, NA, 122, 145, 144, 
    131, 138, NA, 145, NA)), row.names = c(NA, -60L), class = c("tbl_df", 
"tbl", "data.frame"))

And then manually execute the test

h_00<-h %>% dplyr::filter(time =="00")
h_01<- h%>% dplyr::filter(time == "01")
h_66<-  h %>% dplyr::filter(time == "66")

t.test(h_01$peso1, h_66$peso1)
t.test(h_01$cintura1, h_66$cintura1)
t.test(h_01$tasis2_e, h_66$tasis2_e)

t.test(h_00$peso1, h_66$peso1)
t.test(h_00$cintura1, h_66$cintura1)
t.test(h_00$tasis2_e, h_66$tasis2_e)

t.test(h_00$peso1, h_01$peso1)
t.test(h_00$cintura1, h_01$cintura1)
 t.test(h_00$tasis2_e, h_01$tasis2_e)

I do not understand why the results are different.
Doing it, just tasis2_e matches both p-results (your method and manually with the code I posted)
I must be missing sthg out.

About this

t.test(1:10, y = c(7:20, 200))

This is correct

I get alignment between my code and your 'manual' calculation. He's peso1 for example

h_00<-h %>% dplyr::filter(time =="00")
h_01<- h%>% dplyr::filter(time == "01")
h_66<-  h %>% dplyr::filter(time == "66")

t.test(h_01$peso1, h_66$peso1)$p.value
my_ttest_results$ttest_01_compare_ttest_66$ttest_results_for_peso1$p.value

t.test(h_00$peso1, h_66$peso1)$p.value
my_ttest_results$ttest_00_compare_ttest_66$ttest_results_for_peso1$p.value

t.test(h_00$peso1, h_01$peso1)$p.value
my_ttest_results$ttest_00_compare_ttest_01$ttest_results_for_peso1$p.value

> t.test(h_01$peso1, h_66$peso1)$p.value
[1] 0.9960743
> my_ttest_results$ttest_01_compare_ttest_66$ttest_results_for_peso1$p.value
[1] 0.9960743
> 
> t.test(h_00$peso1, h_66$peso1)$p.value
[1] 0.7753007
> my_ttest_results$ttest_00_compare_ttest_66$ttest_results_for_peso1$p.value
[1] 0.7753007
> 
> t.test(h_00$peso1, h_01$peso1)$p.value
[1] 0.7493917
> my_ttest_results$ttest_00_compare_ttest_01$ttest_results_for_peso1$p.value
[1] 0.7493917

OT
As a quick followup to @nirgrahamuk 's code, at the very simplest, a handy way to supply sample data is to use the dput() function. See ?dput. If you have a very large data set then something like head(dput(myfile), 100) will likely supply enough data for us to work with.

It is right what you say. I must have typed sthg wrong
Once again, thank you so much for the help

I guess that making it with aov or creating difference valuable d_peso_v66 = peso1_v66 -peso1_v00 can be executed just changint the function

I am not sure if it is poisoned with too many NA's but with my original database if I do not omit NA's I get:

Error in t.test.default(x = left_frame[[.x]], y = right_frame[[.x]]) :
not enough 'x' observations

On the other hand if I omit NA's but I run it wiht the argument "paired = TRUE" because I have time measurements of the same patient, the what I am getting is

Error in complete.cases(x, y) : not all arguments have the same length

I guess that removing NA's is making my observations unequal

I don't really get why you did not need to define

3rd_frame <- h %>% filter(time == .x[[3]])

In fact, gives error if you do it. But there is a combination that is performed. I dont get why it works without being defined

that stuff is called left_frame and right_frame in my code and combinations_of_time handles the ... well, you know !

I get how the stuff is called, and what is referred to that. I just wonder why is possible to define just [[1]] and [[2]] and NOT [[3]]

Combinations of time is a list of objects each with exactly two parts. If they had three parts then selecting the third would be possible.

1 Like

If i understand correctly, which I dont think I do, the below should help you.

Are you trying to compare all of the continuous variables based on time?

library(tidyverse)

h_tidy <- h %>% 
  na.omit() %>% 
  pivot_longer(cols = -time) 

h_tidy %>%
  group_split(name) %>% 
  map(~{
    aov(value ~ time, data = .x) %>% 
      broom::tidy() %>% 
      mutate(name = unique(.x$name))
  }) %>% 
  bind_rows()