How to subset all rows measured using different measurement tools from a data frame for the longitudinal study

yalem · June 10, 2020, 12:57pm

I have a dataset consisting of repeated measures(4 waves), one data point per row. These data are from a longitudinal study, and measured two different tools based on the study subjects age; therefore, at the moment, not every ID has assessed using one measuring tool at wave one.
the data looks like

subject_id  wave AgeinMonths         BITSEA           CBCL
1           1 wave1          10              7           <NA>
2           1 wave2          38 Not applicable             66
3           1 wave3          56 Not applicable             73
4           1 wave4          76 Not applicable             87
5           2 wave1          38 Not applicable             57
6           2 wave2          46 Not applicable             63
7           2 wave3          66 Not applicable             77
8           3 wave1          23             27           <NA>
9           3 wave2          41 Not applicable             38
10          4 wave1          35             28           <NA>
11          4 wave2          56 Not applicable             70
12          4 wave3          76 Not applicable             77
13          4 wave4          98 Not applicable             97
14          5 wave1          36 Not applicable             66
15          5 wave2          54 Not applicable             45
16          5 wave3          74 Not applicable             67
17          5 wave4          92 Not applicable             68
18          6 wave1          38 Not applicable             54
19          6 wave2          58 Not applicable             63
20          7 wave1          11             23 Not applicable
21          7 wave2          36 Not applicable             45
22          7 wave3          49 Not applicable             51
23          7 wave4          69 Not applicable             65
24          8 wave1          17             26 Not applicable
25          8 wave2          37 Not applicable             45
26          8 wave3          55 Not applicable             57
27          9 wave1          36 Not applicable             56
28         10 wave1          36 Not applicable             45
29         10 wave2          54 Not applicable             55
30         10 wave3          72 Not applicable             65
31         10 wave4          90 Not applicable             71

So,
subject 1,4,7 has four repeated measures BITSEA at wave 1 and 2,3 and 4 CBCL
subject 5 & 7 have four repeated measures CBCL at all waves

I want to be able to extract data points involved in all repeated measures. I wish to obtain those with four repeated measures using BITSEA if subjects are between 9 and 35 months else CBCL at wave 1.
The background of my question is to create a data frame for those who have measured in all waves subsetting subjects measured at wave 1 using either BITSEA or CBCL.
Any thoughts, please?

DavoWW · June 10, 2020, 1:28pm

Hi @yalem,
You need to first get your data into a nice "tidy" format.
If I understand correctly, the first few lines of your data file should look like this:

index subject_id wave AgeinMonths BITSEA CBCL
1 1 wave1 10 7 NA
2 1 wave2 38 NA 66
3 1 wave3 56 NA 73
4 1 wave4 76 NA 87
5 2 wave1 38 NA 57
6 2 wave2 46 NA 63
7 2 wave3 66 NA 77
8 3 wave1 23 27 NA
9 3 wave2 41 NA 38
10 4 wave1 35 28
11 4 wave2 56 NA 70

Then you can read this data with

read.table()

HTH

yalem · June 10, 2020, 1:53pm

Thanks the first col is index and not related to my data analysis.

mfherman · June 10, 2020, 4:38pm

Could you post your data in a copy and paste friendly format? You could use dput() to do this.

yalem · June 11, 2020, 12:00am

Thank you , here is the data in a copy and paste format.

structure(list(
  subject_id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
                 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 9L, 10L, 10L, 10L, 10L), 
   wave = c("wave1", "wave2", "wave3", 
             "wave4", "wave1", "wave2", "wave3", "wave1", "wave2", "wave1", 
            "wave2", "wave3", "wave4", "wave1", "wave2", "wave3", "wave4",                                                                         
            "wave1", "wave2", "wave1", "wave2", "wave3", "wave4", "wave1",                                
            "wave2", "wave3", "wave1", "wave1", "wave2", "wave3", "wave4"),                                                                            
  AgeinMonths = c(10L, 38L, 56L, 76L, 38L, 46L, 66L, 23L, 41L,35L, 56L, 76L, 98L, 36L,
                  54L, 74L, 92L, 38L, 58L, 11L, 36L, 49L, 69L, 17L, 37L, 55L, 36L, 36L, 54L, 72L, 90L), 
    BITSEA = c("7","Not applicable", "Not applicable", "Not applicable", "Not applicable", 
               "Not applicable", "Not applicable", "27", "Not applicable", "28", 
               "Not applicable", "Not applicable", "Not applicable", "Not applicable", 
               "Not applicable", "Not applicable", "Not applicable", "Not applicable", 
               "Not applicable", "23", "Not applicable", "Not applicable", "Not applicable", 
               "26", "Not applicable", "Not applicable", "Not applicable", "Not applicable", 
               "Not applicable", "Not applicable", "Not applicable"),                                
 CBCL = c(NA, "66", "73", "87", "57", "63", "77", NA, "38", NA, "70", "77",  "97", "66", "45", "67", "68", "54", "63",
"Not applicable", "45",  "51", "65", "Not applicable", "45", "57", "56", "45", "55", "65,  "71"))

mfherman · June 11, 2020, 1:01pm

Can you post your expected outcome? I'm not able to follow from your description what you want the data to look like?

yalem · June 11, 2020, 2:45pm

The expected outcome could be

subject_id Wave           Age        BITSEA  CBCL   n
    1       1               10            7   NA    1
    1       2               38            NA  66    1
    1       3               56            NA  86    1
    1       4               76            NA  74    1
    5       1               36            NA  56    1
    5       2               44            NA  74    1
    5       3               88            NA  80    1
    5       4               120           NA  98    1

So, the expected data out would be counting the study subject measured in all four waves, subdivided into individuals whose first measure was BITSEA and those whose first measure was CBLBITSEA or CBCL.

DavoWW · June 12, 2020, 3:46am

Hi @yalem,
After a bit of editing, I think this code should get you closer to your desired output:

subject_id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
               6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 9L, 10L, 10L, 10L, 10L) 
wave = c("wave1", "wave2", "wave3", 
         "wave4", "wave1", "wave2", "wave3", "wave1", "wave2", "wave1", 
         "wave2", "wave3", "wave4", "wave1", "wave2", "wave3", "wave4",                                                                         
         "wave1", "wave2", "wave1", "wave2", "wave3", "wave4", "wave1",                                
         "wave2", "wave3", "wave1", "wave1", "wave2", "wave3", "wave4")                                                                            
AgeinMonths = c(10L, 38L, 56L, 76L, 38L, 46L, 66L, 23L, 41L,35L, 56L, 76L, 98L, 36L,
                54L, 74L, 92L, 38L, 58L, 11L, 36L, 49L, 69L, 17L, 37L, 55L, 36L, 
                36L, 54L, 72L, 90L) 
BITSEA = c("7","Not applicable", "Not applicable", "Not applicable", "Not applicable", 
           "Not applicable", "Not applicable", "27", "Not applicable", "28", 
           "Not applicable", "Not applicable", "Not applicable", "Not applicable", 
           "Not applicable", "Not applicable", "Not applicable", "Not applicable", 
           "Not applicable", "23", "Not applicable", "Not applicable", "Not applicable", 
           "26", "Not applicable", "Not applicable", "Not applicable", "Not applicable", 
           "Not applicable", "Not applicable", "Not applicable")                                
CBCL = c("Not applicable", "66", "73", "87", "57", "63", "77", "Not applicable", "38", 
         "Not applicable", "70", "77",  "97", "66", "45", "67", "68", "54", "63", 
         "Not applicable", "45",  "51", "65", "Not applicable", "45", "57", "56", 
         "45", "55", "65",  "71")

df <- data.frame(subject_id, wave, AgeinMonths, BITSEA, CBCL, stringsAsFactors=FALSE)
df

library(tidyverse)

df %>% 
  mutate(wave = as.factor(str_sub(wave, start=5, end=5)),
         BITSEA = as.integer(ifelse(BITSEA=="Not applicable", NA, BITSEA)),
         CBCL = as.integer(ifelse(CBCL=="Not applicable", NA, CBCL))) -> df2

str(df2)

# Find number of waves recorded per subject
df2 %>% 
  group_by(subject_id) %>%
  summarise(total_wave = n()) -> temp1.df

# Find what type of wave was recorded first
df2 %>% 
  group_by(subject_id) %>% 
  filter(wave == 1) %>% 
  summarise(first_wave = ifelse(is.na(BITSEA), "CBCL", "BITSEA")) -> temp2.df

# Combine the new information
left_join(df2, temp1.df, by="subject_id") %>% 
  left_join(., temp2.df, by="subject_id") -> full.df 
 
# Filter on certain criteria
full.df %>% 
  filter(total_wave == 4)

full.df %>% 
  filter(total_wave == 4) %>% 
  filter(first_wave=="BITSEA")
  
full.df %>% 
  filter(total_wave == 4) %>% 
  filter(first_wave=="CBCL")

HTH

yalem · June 12, 2020, 5:27am

Thank you so much! This is what I want to do.
Much appreciated.

system · July 3, 2020, 5:27am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.