Populate variable with condition based on total of each level

latsyrhccm · April 23, 2020, 3:24pm

I was looking for help with a problem I'm currently stuck on in a project (reprex after the text).

Basically what I'm trying to do is populate a variable with a criteria of levels, based on the number of times a patient recorded data for the week in order to explore quality of recording.

The criteria for levels is as follows:

3+ readings/week == "4",
3 readings/week == "3",
2 readings/week == "2",
1 reading/week == "1",
NA == "0"

I first created a new variable for weeks by using the week() function of lubridate, which gave me week numbers based on where the dates were in the year. Ideally I'd like to allocate week numbers in ascending order (1-n), starting at 1 from the first date the patient recorded to the last date.

Been thinking of using a for loop but currently using case_when. The problem I'm currently encountering is setting the condition to check the frequency of levels for each patient id in order to then assign the criteria.

Any help would be beneficial as I've been stuck on it for most of today, many thanks (reprex below).

library(lubridate)
#> Warning: package 'lubridate' was built under R version 3.5.3
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 3.5.3
#> Warning: package 'ggplot2' was built under R version 3.5.3
#> Warning: package 'tibble' was built under R version 3.5.3
#> Warning: package 'tidyr' was built under R version 3.5.3
#> Warning: package 'readr' was built under R version 3.5.3
#> Warning: package 'purrr' was built under R version 3.5.3
#> Warning: package 'dplyr' was built under R version 3.5.3
#> Warning: package 'stringr' was built under R version 3.5.3
#> Warning: package 'forcats' was built under R version 3.5.3

##Variables##

patientid <- c("-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", 
               "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646")

date <- c("2018-08-06", "2018-08-07", "2018-08-07", "2018-08-07", "2018-08-15", "2018-08-15", "2018-08-15", "2018-08-20", "2018-08-20",
          "2018-08-20", "2018-08-27", "2018-08-27", "2018-08-27", "2018-09-03", "2018-09-03", "2018-09-03")

week <- week(date)

adherence <- ""

test.df <- data.frame(patientid, date, week, adherence) #test df with variables above

##Dataframe and attempt##

table(test.df$week) #See frequency of each
#> 
#> 32 33 34 35 36 
#>  4  3  3  3  3

test.df <- test.df %>% #Dataframe
  mutate(
    patientid = as.factor(patientid),
    date = as.Date(date),
    week = as.factor(week))

adherence <- test.df %>% #Attempt to create if/else/else if loop to populate adherence
  mutate(week = 
           if(count(week) > 3){adherence == "4"})
#> Error in UseMethod("summarise_"): no applicable method for 'summarise_' applied to an object of class "factor"

siddharthprabhu · April 23, 2020, 4:43pm

It seems summarize() and case_when() might do the job here. Let me know if this isn't in line with your expected output.

library(tidyverse)
library(lubridate, warn.conflicts = FALSE)

##Variables##

patientid <- c("-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", 
               "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", 
               "-2147483646", "-2147483646", "-2147483646", "-2147483646")

date <- c("2018-08-06", "2018-08-07", "2018-08-07", "2018-08-07", "2018-08-15", "2018-08-15", "2018-08-15", 
          "2018-08-20", "2018-08-20", "2018-08-20", "2018-08-27", "2018-08-27", "2018-08-27", "2018-09-03", 
          "2018-09-03", "2018-09-03")

test.df <- data.frame(patientid, date)

##Dataframe and attempt##

test.df %>% 
  mutate(date = as_date(date), week = week(date)) %>% 
  group_by(patientid, week) %>% 
  count() %>% 
  mutate(adherence = case_when(n > 3 ~ 4L, 
                               n %in% c(1L, 2L, 3L) ~ n,
                               TRUE ~ 0L))
#> # A tibble: 5 x 4
#> # Groups:   patientid, week [5]
#>   patientid    week     n adherence
#>   <fct>       <dbl> <int>     <int>
#> 1 -2147483646    32     4         4
#> 2 -2147483646    33     3         3
#> 3 -2147483646    34     3         3
#> 4 -2147483646    35     3         3
#> 5 -2147483646    36     3         3

^{Created on 2020-04-23 by the reprex package (v0.3.0)}

latsyrhccm · April 23, 2020, 5:35pm

Thanks, this works for the test dataframe but unfortunately not on my main dataframe. My main one has different patient entries that occurred on the same date, so it accounts for those when it tallies them up into n.

If there was a way to apply above to duplicate dates across different patient ids it would fix it. I've been trying to figure this out by changing the grouping of the dataframe, but couldn't find a way to distinguish duplicates to the function.

Thanks for the help.

siddharthprabhu · April 23, 2020, 6:01pm

Do you mean that if a single patient has multiple recordings on the same date, it should be treated as just one recording? We can handle that by using n_distinct() instead of n().

library(tidyverse)
library(lubridate, warn.conflicts = FALSE)

patientid <- c("-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", 
               "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", 
               "-2147483646", "-2147483646", "-2147483646", "-2147483646")

date <- c("2018-08-06", "2018-08-07", "2018-08-07", "2018-08-07", "2018-08-15", "2018-08-15", "2018-08-15", 
          "2018-08-20", "2018-08-20", "2018-08-20", "2018-08-27", "2018-08-27", "2018-08-27", "2018-09-03", 
          "2018-09-03", "2018-09-03")

test.df <- data.frame(patientid, date)

test.df %>% 
  mutate(date = as_date(date), week = week(date)) %>% 
  group_by(patientid, week) %>% 
  summarize(n = n_distinct(date)) %>% 
  mutate(adherence = case_when(n > 3 ~ 4L, 
                               n %in% c(1L, 2L, 3L) ~ n,
                               TRUE ~ 0L))
#> # A tibble: 5 x 4
#> # Groups:   patientid [1]
#>   patientid    week     n adherence
#>   <fct>       <dbl> <int>     <int>
#> 1 -2147483646    32     2         2
#> 2 -2147483646    33     1         1
#> 3 -2147483646    34     1         1
#> 4 -2147483646    35     1         1
#> 5 -2147483646    36     1         1

^{Created on 2020-04-23 by the reprex package (v0.3.0)}

latsyrhccm · April 23, 2020, 7:39pm

Lack of explanation on my part - what I'm looking to do is use this code basically to aggregate adherence like this example reprex:

# Turn this
week n
32   2
32   1
33   1
33   1

# Into this
patientid week n adherence
1234      32   2 3
1234      33   1 2

I want to do this for each patient id. Each patient id shares dates and weeks with other patient ids, so I want to aggregate on a patient-by-patient basis.

siddharthprabhu · April 24, 2020, 5:59am

OK, let me make sure I understand clearly before supplying more code. Please correct me if my interpretation isn't right.

The variable n should contain the number of observations (readings) recorded by one patientid in a given week. If the same patientid has more than one reading on the same day, it should treated as a separate observation (and thus added to the n count).
adherence should be calculated based on the value of n for each combination of patientid and week using the following criteria:
3+ n per week == "4",
3 n per week == "3",
2 n per week == "2",
1 n per week == "1",
NA == "0"

I'm asking for this clarification because from your last post it seems as though adherence is simply computed by summing n for each patientid and week and not by using the rule set above.

latsyrhccm · April 24, 2020, 10:25am

That's it 100%. Thanks for sticking with this I appreciate it a lot.

siddharthprabhu · April 24, 2020, 10:59am

No problem. Can you try this code on your main data frame? n will be computed for each combination of patientid and week, so that should address the issue quoted below.

test.df %>% 
  mutate(date = as_date(date), week = week(date)) %>% 
  count(patientid, week) %>% 
  mutate(adherence = case_when(n > 3 ~ 4L, 
                               n %in% c(1L, 2L, 3L) ~ n,
                               TRUE ~ 0L))

latsyrhccm · April 24, 2020, 11:26am

That's it mate! Thanks very much for your help, means a great deal.

If I wanted date to display too, would arrange()/group_by() do that? As when I add date into group_by() it returns just the same column as count(week).

siddharthprabhu · April 24, 2020, 11:49am

That won't work because the individual dates are being aggregated by count().

If you want to retain the original structure, we'll need to perform a grouped mutate() instead. This will result in multiple rows for each combination of patientid and week with each row having the same values for n and adherence.

library(tidyverse)
library(lubridate, warn.conflicts = FALSE)

patientid <- c("-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", 
               "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", "-2147483646", 
               "-2147483646", "-2147483646", "-2147483646", "-2147483646")

date <- c("2018-08-06", "2018-08-07", "2018-08-07", "2018-08-07", "2018-08-15", "2018-08-15", "2018-08-15", 
          "2018-08-20", "2018-08-20", "2018-08-20", "2018-08-27", "2018-08-27", "2018-08-27", "2018-09-03", 
          "2018-09-03", "2018-09-03")

test.df <- data.frame(patientid, date)

test.df %>% 
  mutate(date = as_date(date), week = week(date)) %>% 
  group_by(patientid, week) %>% 
  mutate(n = n(), 
         adherence = case_when(n > 3 ~ 4L, 
                               n %in% c(1L, 2L, 3L) ~ n,
                               TRUE ~ 0L))
#> # A tibble: 16 x 5
#> # Groups:   patientid, week [5]
#>    patientid   date        week     n adherence
#>    <fct>       <date>     <dbl> <int>     <int>
#>  1 -2147483646 2018-08-06    32     4         4
#>  2 -2147483646 2018-08-07    32     4         4
#>  3 -2147483646 2018-08-07    32     4         4
#>  4 -2147483646 2018-08-07    32     4         4
#>  5 -2147483646 2018-08-15    33     3         3
#>  6 -2147483646 2018-08-15    33     3         3
#>  7 -2147483646 2018-08-15    33     3         3
#>  8 -2147483646 2018-08-20    34     3         3
#>  9 -2147483646 2018-08-20    34     3         3
#> 10 -2147483646 2018-08-20    34     3         3
#> 11 -2147483646 2018-08-27    35     3         3
#> 12 -2147483646 2018-08-27    35     3         3
#> 13 -2147483646 2018-08-27    35     3         3
#> 14 -2147483646 2018-09-03    36     3         3
#> 15 -2147483646 2018-09-03    36     3         3
#> 16 -2147483646 2018-09-03    36     3         3

^{Created on 2020-04-24 by the reprex package (v0.3.0)}

latsyrhccm · April 24, 2020, 12:55pm

Cool thanks for the elaboration!

Ideally that's what I wanted to achieve, much appreciated for going out of your way to help me with this.

Just one last thing. Now that I have adherence added to a table with patientid and weeks, I had an idea that I could try and merge it with the original dataframe based on patientid and week.

By merging them by matching variables arranged in the same order, could that potentially give me my final goal of having dates included? About to try this so more just spitting out ideas.

siddharthprabhu · April 24, 2020, 1:23pm

The grouped mutate answer I provided above contains the date variable. Merging the summarized table with your original data frame will give you the same result.

latsyrhccm · April 24, 2020, 1:31pm

Ah yes you're right. That's basically it man thank you for all the help.

All the best!

siddharthprabhu · April 24, 2020, 1:45pm

Cool! If I solved your problem, please consider marking the relevant post as a solution.

system · May 1, 2020, 1:45pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.