How do I collapse (join) data by columns?

Hi all, I am a beginner with R and this is my second post in RSTudio community. I want to concentrate the data from two variables in R Studio column-wise.

My dataframe looks like this
|Subject|Group|Variable @time 1|@time 2|@time3…|
| A | 1 | --- | --- | --- |
| B | 1 | --- | --- | --- |
| C | 2 | --- | --- | --- |
| D | 2 | --- | --- | --- |

I want to collapse the time points so the average of 1 and 2 are together, 3 and 4 are together etc
I also want to find a mean for each group of subjects at each time point

Does anyone have any ideas of where to start? I thought the mutate function would work or full_join but I have got into a bit of confusion

Sorry about my formatting of this question. I really am very new to R Studio

Thank you so much for your help

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

1 Like

If I have done this correctly here is a reprex:

SUBJECT SYMPTOMATIC PROG.11 PROG.10 PROG.9 PROG.8 PROG.7
AA YES 76.27 137.103 143.811 361.197
AB YES 187.995 294.134 300.879 313.079 211.899
AD YES 165.59 193.45 130.65 113.41
AF YES 303.014 233.919 115.764 272.676 261.404
AG YES 78.008 272.676 147.447 123.496 166.615
AH YES 264.179 586.99 165.701 95.292 399.361

In the symptomatic column, some are 'NO' towards the bottom.

The PROG.X are repeated measures of one variable across different time points. As there is quite a bit of missing data, I am trying to collapse them together as averages of two time points.

After that I will try to produce an average for subjects in each group for each time point.

This is my first time dealing with repeated measures data on R Studio or similar so I am really confused, apologies.

I really am not sure about the minimal runnable code...

That is not actually copy/paste friendly, and it is not clear to me what your desired output is, maybe a sample of the desired output would be useful to understand your issue but in the meanwhile, I'm going to make a reprex of what I understand from what you have said so we can go from there.

library(dplyr)

sample_df <- data.frame(stringsAsFactors=FALSE,
       SUBJECT = c("AA", "AB", "AD", "AF", "AG", "AH"),
   SYMPTOMATIC = c("YES", "YES", "YES", "NO", "NO", "NO"),
       PROG.11 = c(76.27, 187.995, NA, 303.014, 78.008, 264.179),
       PROG.10 = c(137.103, 294.134, 165.59, 233.919, 272.676, 586.99),
        PROG.9 = c(NA, 300.879, 193.45, 115.764, 147.447, 165.701),
        PROG.8 = c(143.811, 313.079, 130.65, 272.676, 123.496, 95.292),
        PROG.7 = c(361.197, 211.899, 113.41, 261.404, 166.615, 399.361)
)

sample_df %>% 
    group_by(SYMPTOMATIC) %>% 
    summarise_at(vars(starts_with("PROG")), mean, na.rm = TRUE) # This calculates the mean of all variables by SYMPTOMATIC
#> # A tibble: 2 x 6
#>   SYMPTOMATIC PROG.11 PROG.10 PROG.9 PROG.8 PROG.7
#>   <chr>         <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
#> 1 NO             215.    365.   143.   164.   276.
#> 2 YES            132.    199.   247.   196.   229.

Created on 2019-12-10 by the reprex package (v0.3.0.9000)

Oh dear! I produced it using datapasta but I think I misunderstood how to paste the reprex on here

PMS <- tibble::tribble(
~SUBJECT, ~SYMPTOMATIC, ~PROG.11, ~PROG.10, ~PROG.9, ~PROG.8, ~PROG.7,
"AA", "YES", 76.27, 137.103, NA, 143.811, 361.197,
"AB", "YES", 187.995, 294.134, 300.879, 313.079, 211.899,
"AD", "YES", NA, 165.59, 193.45, 130.65, 113.41,
"AF", "YES", 303.014, 233.919, 115.764, 272.676, 261.404,
"AG", "YES", 78.008, 272.676, 147.447, 123.496, 166.615,
"AH", "YES", 264.179, 586.99, 165.701, 95.292, 399.361
)
head(PMS)

Perhaps that is more useful? I don't know how to see what it will look like until I reply.

So, I am not trying to group by symptomatic yes/ no. I am trying to group the prog. columns so that the average of prog.11 + prog.10 are together as a new column prog.11_10.

Ok, that didn't work with the reprex, let me try again

Grouping in this manner doesn't seem tidy and I don't understand what the grouping criteria would be, so if you want to create the groupings manually you could do something like this

library(tidyverse)

sample_df <- data.frame(stringsAsFactors=FALSE,
       SUBJECT = c("AA", "AB", "AD", "AF", "AG", "AH"),
   SYMPTOMATIC = c("YES", "YES", "YES", "YES", "YES", "YES"),
       PROG.11 = c(76.27, 187.995, NA, 303.014, 78.008, 264.179),
       PROG.10 = c(137.103, 294.134, 165.59, 233.919, 272.676, 586.99),
        PROG.9 = c(NA, 300.879, 193.45, 115.764, 147.447, 165.701),
        PROG.8 = c(143.811, 313.079, 130.65, 272.676, 123.496, 95.292),
        PROG.7 = c(361.197, 211.899, 113.41, 261.404, 166.615, 399.361)
)

sample_df %>% 
    rowwise() %>% 
    mutate(PROG.11_10 = mean(c(PROG.11, PROG.10), na.rm = TRUE))
#> Source: local data frame [6 x 8]
#> Groups: <by row>
#> 
#> # A tibble: 6 x 8
#>   SUBJECT SYMPTOMATIC PROG.11 PROG.10 PROG.9 PROG.8 PROG.7 PROG.11_10
#>   <chr>   <chr>         <dbl>   <dbl>  <dbl>  <dbl>  <dbl>      <dbl>
#> 1 AA      YES            76.3    137.    NA   144.    361.       107.
#> 2 AB      YES           188.     294.   301.  313.    212.       241.
#> 3 AD      YES            NA      166.   193.  131.    113.       166.
#> 4 AF      YES           303.     234.   116.  273.    261.       268.
#> 5 AG      YES            78.0    273.   147.  123.    167.       175.
#> 6 AH      YES           264.     587.   166.   95.3   399.       426.

Although I think it would be better if you work with your data in a long format and try to define a grouping criteria, this would be your sample data in long format.

sample_df %>% 
    pivot_longer(starts_with("PROG"),
                 names_to = "TIME",
                 values_to = "PROG",
                 names_pattern = "(\\d+)") %>%
    mutate(TIME = as.numeric(TIME)) %>% 
    arrange(SUBJECT, SYMPTOMATIC, TIME)
#> # A tibble: 30 x 4
#>    SUBJECT SYMPTOMATIC  TIME  PROG
#>    <chr>   <chr>       <dbl> <dbl>
#>  1 AA      YES             7 361. 
#>  2 AA      YES             8 144. 
#>  3 AA      YES             9  NA  
#>  4 AA      YES            10 137. 
#>  5 AA      YES            11  76.3
#>  6 AB      YES             7 212. 
#>  7 AB      YES             8 313. 
#>  8 AB      YES             9 301. 
#>  9 AB      YES            10 294. 
#> 10 AB      YES            11 188. 
#> # … with 20 more rows

Ah you might be right about the long format. Perhaps I need to go back to the beginning and rethink.

I put it in the wide format as I thought it would be better to analyse with ANOVA later (although not sure I will get there at this point).

You can easily go back and forth between long and wide format as needed, for example, I'm going to use an arbitrary grouping criteria on the long format data, calculate the means and then change it back to wide.

sample_df %>% 
    pivot_longer(starts_with("PROG"),
                 names_to = "TIME",
                 values_to = "PROG",
                 names_pattern = "(\\d+)") %>%
    mutate(TIME = as.numeric(TIME),
           TIME_GROUP = if_else(TIME <= 9, 1, 2)) %>% # Grouping criteria
    group_by(SUBJECT, TIME_GROUP) %>%
    summarise(MEAN_PROG = mean(PROG, na.rm = TRUE)) %>% 
    pivot_wider(id_cols = "SUBJECT",
                names_from = "TIME_GROUP",
                values_from = "MEAN_PROG",
                names_prefix = "PROG_")
#> # A tibble: 6 x 3
#> # Groups:   SUBJECT [6]
#>   SUBJECT PROG_1 PROG_2
#>   <chr>    <dbl>  <dbl>
#> 1 AA        253.   107.
#> 2 AB        275.   241.
#> 3 AD        146.   166.
#> 4 AF        217.   268.
#> 5 AG        146.   175.
#> 6 AH        220.   426.
1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.