R: duplicate values row-wise based on group then select rows that match a criteria - unusual wrangling problem

Andrius · March 25, 2021, 8:43pm

Hi Guys, new here and fairly fresh to R.

I have data set that needs cleaning.
ID column is identifies unique subjects.
visit_index is a visit number (total of 3 visits per subject).

What I need is:

to get measurement and adherence duplicated across their corresponding subjects, by ID
to only keep "exercise" intervention row and another row that has visit_index = minus 1 of the exercise visit_index.

So we end up with two rows per subject, with measurements duplicated.
Actual dataset is bigger - over 100 variables. So to duplicate values I'd like to pass in a list of variables or a range(s) of columns.

I got this step but could not progress further.

# This grabs the value of visit_index for exercise, adds into new column
df2 <- df2 %>% 
  group_by(ID) %>%
  mutate(
    visit_exercise =
      ifelse(intervention == "exercise", visit_index, NA)
  )

Input data and desired output:

# Example data:
df2 <- read.table(text=
"visit_index	ID	intervention	adherence	measurement
0	01JV	baseline	66.1	24.5
1	01JV	exercise	NA	NA
2	01JV	detrain	NA	NA
0	02AM	baseline	52.0	21.3
1	02AM	detrain	NA	NA
2	02AM	exercise	NA	NA
0	03JW	baseline	83.7	23.6
1	03JW	detrain	NA	NA
2	03JW	exercise	NA	NA
", header=TRUE) 


# desired output:
df3 <- read.table(text=
                    "visit_index	ID	intervention	adherence	measurement
0	01JV	baseline	66.1	24.5
1	01JV	exercise	66.1	24.5
1	02AM	detrain	52.0	21.3
2	02AM	exercise	52.0	21.3
1	03JW	detrain	83.7	23.6
2	03JW	exercise	83.7	23.6
", header=TRUE)

technocrat · March 25, 2021, 9:31pm

I can't connect the stated logic with the df3 exemplar as regards intervention, so this may not be what is desired.

df2 <- read.table(text=
                    "visit_index    ID  intervention    adherence   measurement
0   01JV    baseline    66.1    24.5
1   01JV    exercise    NA  NA
2   01JV    detrain NA  NA
0   02AM    baseline    52.0    21.3
1   02AM    detrain NA  NA
2   02AM    exercise    NA  NA
0   03JW    baseline    83.7    23.6
1   03JW    detrain NA  NA
2   03JW    exercise    NA  NA
", header=TRUE) 

suppressPackageStartupMessages({
  library(dplyr)
})

df2 %>% mutate(adherence = case_when(
  !is.na(adherence) ~ adherence,
  is.na(adherence) ~ lag(adherence),
  is.na(lag(adherence)) ~ lag(adherence),
)) %>% mutate(adherence = case_when(
  !is.na(adherence) ~ adherence,
  is.na(adherence) ~ lag(adherence),
  is.na(lag(adherence)) ~ lag(adherence),
)) %>%
  mutate(measurement = case_when(
    !is.na(measurement) ~ measurement,
    is.na(measurement) ~ lag(measurement),
    is.na(lag(measurement)) ~ lag(measurement),
  )) %>%
  mutate(measurement = case_when(
    !is.na(measurement) ~ measurement,
    is.na(measurement) ~ lag(measurement),
    is.na(lag(measurement)) ~ lag(measurement),
  )) %>% 
  filter(intervention != "exercise")
#>   visit_index   ID intervention adherence measurement
#> 1           0 01JV     baseline      66.1        24.5
#> 2           2 01JV      detrain      66.1        24.5
#> 3           0 02AM     baseline      52.0        21.3
#> 4           1 02AM      detrain      52.0        21.3
#> 5           0 03JW     baseline      83.7        23.6
#> 6           1 03JW      detrain      83.7        23.6

nirgrahamuk · March 25, 2021, 9:38pm

another option


#to get measurement and adherence duplicated across their corresponding subjects, by ID
(df2_a <- group_by(df2,ID) %>%
  mutate(across(c(adherence,
                  measurement),
                min,na.rm=TRUE)) %>%
    ungroup())

# to only keep "exercise" intervention row and another row that has visit_index = minus 1 of the exercise visit_index.
(df2_b <-filter(df2_a,
                intervention=="exercise") %>%
    mutate(vi_to_match = visit_index - 1L))

(df2_c <-distinct(df2_b,ID,vi_to_match) %>% 
    left_join(df2_a, 
              by=c("vi_to_match"="visit_index",
                                "ID"="ID")) %>%
    rename("visit_index"="vi_to_match"))

#combine
(df3x <- bind_rows(df2_b,df2_c) %>% 
    select(-vi_to_match) %>% 
    arrange(ID,visit_index))

mhakanda · March 25, 2021, 9:46pm

Welcome to the community. You can try this way

df2 %>%
  group_by(ID) %>%
  mutate(adherence = sum(adherence,na.rm = TRUE), measurement = sum(measurement,na.rm = TRUE)) %>%
  mutate(dd = as.numeric(intervention=="exercise")*visit_index-1) %>%
  filter(intervention == "exercise" | (visit_index==max(dd))) %>%
  mutate(dd=NULL) %>%
  ungroup()
#> # A tibble: 6 x 5
#>   visit_index ID    intervention adherence measurement
#>         <int> <chr> <chr>            <dbl>       <dbl>
#> 1           0 01JV  baseline          66.1        24.5
#> 2           1 01JV  exercise          66.1        24.5
#> 3           1 02AM  detrain           52          21.3
#> 4           2 02AM  exercise          52          21.3
#> 5           1 03JW  detrain           83.7        23.6
#> 6           2 03JW  exercise          83.7        23.6

Andrius · March 28, 2021, 12:01am

Thank you mhakanda, that was the most elegant solution!

system · April 4, 2021, 12:01am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.