Counting first occurrences of a disease in R

I am trying to count the first occurrence of a disease (say myocardial infarction (MI) "heart attack") but I am having a hard time implementing this in R (base or tidyverse). Any help is appreciated.

n_id <- 5 # five individuals
n_time <- 4 # four time pints
id <- rep(1:n_id, each = n_time)
time <- rep(1:n_time,times = n_id)
MI <- c(0,0,1,1,
        0,1,1,1,
        0,0,0,1,
        0,0,0,0,
        0,0,0,0)
dsn <- data.frame(id, time, MI)
dsn
#>    id time MI
#> 1   1    1  0
#> 2   1    2  0
#> 3   1    3  1
#> 4   1    4  1
#> 5   2    1  0
#> 6   2    2  1
#> 7   2    3  1
#> 8   2    4  1
#> 9   3    1  0
#> 10  3    2  0
#> 11  3    3  0
#> 12  3    4  1
#> 13  4    1  0
#> 14  4    2  0
#> 15  4    3  0
#> 16  4    4  0
#> 17  5    1  0
#> 18  5    2  0
#> 19  5    3  0
#> 20  5    4  0

#I want to count the first occurrence of MI but I am not sure how to do so
#Perhaps, I first need to set the second and later occurrences to missing
#and then count, but this seems inefficient

MI2 <- c(0,0,1,NA,
        0,1,NA,NA,
        0,0,0,1,
        0,0,0,0,
        0,0,0,0)
dsn2 <- data.frame(id, time, MI, MI2)
dsn2
#>    id time MI MI2
#> 1   1    1  0   0
#> 2   1    2  0   0
#> 3   1    3  1   1
#> 4   1    4  1  NA
#> 5   2    1  0   0
#> 6   2    2  1   1
#> 7   2    3  1  NA
#> 8   2    4  1  NA
#> 9   3    1  0   0
#> 10  3    2  0   0
#> 11  3    3  0   0
#> 12  3    4  1   1
#> 13  4    1  0   0
#> 14  4    2  0   0
#> 15  4    3  0   0
#> 16  4    4  0   0
#> 17  5    1  0   0
#> 18  5    2  0   0
#> 19  5    3  0   0
#> 20  5    4  0   0
incidence <- sum(dsn2$MI2, na.rm = TRUE)/n_id
incidence
#> [1] 0.6

Only three people out of 5 were diagnosed with MI and so the incidence is = 3/5 =0.6

#I am trying to find a way to get to the 0.6 without having to count manually
#Any help is appreciated

count of 'first' occurance, is logically equivalent to count of 'any' occurence for the id ?
if so then ...

dsn2 <- dsn %>% group_by(id) %>%
                        summarise(MI_max = max(MI))


incidence <- mean(dsn2$MI_max)

#perhaps simpler syntax
dsn2 <- dsn %>% 
  select(id,MI) %>% 
  group_by(id) %>%
  summarise_all(max)

incidence <- mean(dsn2$MI_max)
1 Like

Thank you. This works great. I realize I wasn't clear in my example. These methods work great in general but I wanted a way to get the incidence and prevalence by time period. The incidence is the proportion of new cases that occur at a particular time divided by the number of people who did not get the disease

n_id <- 5 # five individuals
n_time <- 4 # four time pints
id <- rep(1:n_id, each = n_time)
time <- rep(1:n_time,times = n_id)
MI <- c(0,0,1,1,
        0,1,1,1,
        0,0,0,1,
        0,0,0,0,
        0,0,0,0)
dsn <- data.frame(id, time, MI)
MI2 <- c(0,0,1,NA,
         0,1,NA,NA,
         0,0,0,1,
         0,0,0,0,
         0,0,0,0)
dsn2 <- data.frame(id, time, MI, MI2)
library(dplyr)
arrange(dsn2, time)
dsn2

#>    id time MI MI2
#> 1   1    1  0   0
#> 2   2    1  0   0
#> 3   3    1  0   0
#> 4   4    1  0   0
#> 5   5    1  0   0
#> 6   1    2  0   0
#> 7   2    2  1   1
#> 8   3    2  0   0
#> 9   4    2  0   0
#> 10  5    2  0   0
#> 11  1    3  1   1
#> 12  2    3  1  NA
#> 13  3    3  0   0
#> 14  4    3  0   0
#> 15  5    3  0   0
#> 16  1    4  1  NA
#> 17  2    4  1  NA
#> 18  3    4  1   1
#> 19  4    4  0   0
#> 20  5    4  0   0

#in the example above, it can be calculated as below
#For the incidence at each time point (proportion of new cases that occur at a particular time divided by the number of people who did not get the disease)
#time 1 = 0/5 =0
#time 2 = 1/5 =0.2
#time 3 = 1/4 =0.25
#time 4 = 1/3 =0.33

##For the prevalence at each time point (the proportion of new and old cases divided by total population)
#time 1 = 0/5 =0
#time 2 = 1/5 =0.2
#time 3 = 2/5 =0.4
#time 4 = 3/5 =0.6

time <- 1:4
incidence <- c(0/5, 1/5, 1/4, 1/3)
prevalence <- c(0/5, 1/5, 2/5, 3/5)

results <- cbind(time, incidence, prevalence)
results
#>      time incidence prevalence
#> [1,]    1 0.0000000        0.0
#> [2,]    2 0.2000000        0.2
#> [3,]    3 0.2500000        0.4
#> [4,]    4 0.3333333        0.6

I'd' like to be able to do this for each time point and accounting for what happened at the previous time point. Would a for loop be the way to go? Thank you so much

I'm missing the concept here. what do NA's mean for MI2 ? that its not know whether they did or didnt have a second heart attack in the period of time being measured ?
I'm not sure what you consider the categories of New and Old to represent, is that per id across the time variable (oh yes id 2 has MI in period 3 but they had MI in period 2 so in period 3 its 'old' ) ? or across the MI and MI2 distinction ? or both ?

in time period for whete you have 1 over 3 representing new cases(1) dividied people who did not get the disease (3) who are the 3 ids who you consider to not have gotten the ids?
I see that in time period 4 , the ids 4 and 5 did not get any MI nor MI2.
alternatively only a single id in period 4 (id 3) has an MI2 so its not 5-1=4. So im confused how the 3 gets counted. Can you clarify the concepts at play ?

Apologies if I'm being dense

It sounds like incidence at time t depends only on people who were disease-free at time t -1, and that MI2 was @simRock's way of trying to solve the problem, but that all the necessary information is probably contained in dsn -- is that right @simRock?

I've tried to find a solution, given such assumptions:

n_id <- 5 # five individuals
n_time <- 4 # four time pints
id <- rep(1:n_id, each = n_time)
time <- rep(1:n_time,times = n_id)
MI <- c(0,0,1,1,
        0,1,1,1,
        0,0,0,1,
        0,0,0,0,
        0,0,0,0)
dsn <- data.frame(id, time, MI) %>% arrange(time,id)

library(tidyverse)


dsn2 <- group_by(dsn,id) %>%
  mutate(lag_MI=lag(MI),
         event = case_when( MI==1 & !(lag_MI==1) ~ "New",
                            MI==1 & (lag_MI==1)  ~ "Old",
                            TRUE ~ "0"))

dsn2
# A tibble: 20 x 5
# Groups:   id [5]
# id  time    MI lag_MI event
# <int> <int> <dbl>  <dbl> <chr>
# 1      1     1     0     NA 0    
# 2      2     1     0     NA 0    
# 3      3     1     0     NA 0    
# 4      4     1     0     NA 0    
# 5      5     1     0     NA 0    
# 6      1     2     0      0 0    
# 7      2     2     1      0 New  
# 8      3     2     0      0 0    
# 9      4     2     0      0 0    
# 10     5     2     0      0 0    
# 11     1     3     1      0 New  
# 12     2     3     1      1 Old  
# 13     3     3     0      0 0    
# 14     4     3     0      0 0    
# 15     5     3     0      0 0    
# 16     1     4     1      1 Old  
# 17     2     4     1      1 Old  
# 18     3     4     1      0 New  
# 19     4     4     0      0 0    
# 20     5     4     0      0 0    

#For the incidence at each time point
#(proportion of new cases that occur at a particular time 
#divided by the number of people who did not get the disease)
#time 1 = 0/5 =0
#time 2 = 1/5 =0.2   <--- 1 person got the disease therefore not more than 4 people did not get the disease so rather than divide by 5 should divide by 4 and get 0.25 ?
#time 3 = 1/4 =0.25
#time 4 = 1/3 =0.33

incidence_report <- dsn2 %>% group_by(time) %>%
  summarise(indicidence_never = sum(event=="New")/sum(event=="0"),
            indicidence_inc_old = sum(event=="New")/sum(event!="New"))
# A tibble: 4 x 3
# time indicidence_never indicidence_inc_old
# <int>             <dbl>               <dbl>
# 1     1             0                    0   
# 2     2             0.25                 0.25
# 3     3             0.333                0.25
# 4     4             0.5                  0.25



##For the prevalence at each time point 
#(the proportion of new and old cases divided by total population)
#time 1 = 0/5 =0
#time 2 = 1/5 =0.2
#time 3 = 2/5 =0.4
#time 4 = 3/5 =0.6

prevalence_report <- dsn2 %>% group_by(time) %>%
  summarise(prevalence = sum(event %in% c("New","Old")/n()))

> prevalence_report
# A tibble: 4 x 2
# time prevalence
# <int>      <dbl>
#   1        0  
#   2        0.2
#   3        0.4
#   4        0.6

combineded_report <- dsn2 %>% group_by(time) %>%
  summarise(indicidence_never = sum(event=="New")/sum(event=="0"),
            indicidence_inc_old = sum(event=="New")/sum(event!="New"),
            prevalence = sum(event %in% c("New","Old")/n()))

Thank you both @ nirgrahamuk and dromano
Actually, I created the MI2 with NA's as another way to get what I wanted. But I don't think it is the best way. You can ignore MI2.
Basically I wanted to calculate the incidence as the number of new cases at a certain point divided by the number of people free of disease. So that means that in time 1 (5 -0 = 5 people are free of disease), in time 2 (5- 0 = 5 people are free of disease), in time 3 (5-1 = 4 people are free of disease) and in time 4 (4-1 = 3 are free of disease). and so to get the incidence
Hope that helps

The below is an attempt to this. Someone else offered an answer that I modified. Is this the most efficient way to go about it? Thank you. I'll use it if I can't find anything else

library(tidyverse)
dsn2 %>%
  #Group by time
  group_by(time) %>%
  #Get the sum of positives and negatives, as well as total ID number
  summarise(pos = sum(MI ==1),
            neg = sum(MI ==0),
            totalID = n_distinct(id)) %>%
  #add lagged entry of positives
  mutate(poslag = lag(pos), 
         neglag = lag(neg)) %>%
  #Replace NA with zero in poslag and 1 in neglag (because of the division)
  mutate(poslag = case_when(is.na(poslag) ~ 0, TRUE ~ as.double(poslag)),
         neglag = case_when(is.na(neglag) ~ 1, TRUE ~ as.double(neglag))) %>%
  #Get the number of new cases using pos and poslag
  mutate(news = pos - poslag) %>%
  #Get incidence and prevalence
  mutate(incidence = news/neglag,
         prevalence = pos/totalID) %>%
  #Stay only with the time, incidence and prevalence columns
  select(time, incidence, prevalence)
#> # A tibble: 4 x 3
#>    time incidence prevalence
#>   <int>     <dbl>      <dbl>
#> 1     1     0            0  
#> 2     2     0.2          0.2
#> 3     3     0.25         0.4
#> 4     4     0.333        0.6

I'm still unsure about one thing: Is the MI column intended to be always 0 before an occurrence, and always 1 after? Or could a subject have MI = 0 at time 1, 1 at time 2, 0 at time 3, and 1 at time 4, indicating occurrences only at times 2 and 4?

Assuming the answer is 'yes' to the second question, here's an alternative (which also works if the answer is 'no'):

dsn %>% 
  # first calculate the number of MI to date per subject
  group_by(id) %>% 
  arrange(time) %>% 
  mutate(MI_to_date = cumsum(MI)) %>%
  ungroup() %>% 
  # identify subjects with multiple occurrences to date
  mutate(m.o. = if_else(MI_to_date > 1, 1, 0)) %>% 
  # identify dates of first occurence
  mutate(f.o. = if_else(MI_to_date == 1, 1, 0)) %>% 
  # calculate incidence at date using subjects previously without MI
  # (assigns incidence to these subjects, 0 to rest)
  group_by(time, m.o.) %>% 
  mutate(incid. = sum(f.o.)/n()) %>% 
  # extract incidence and prevalence
  group_by(time) %>% 
  summarise(incid. = max(incid.), prev. = sum(f.o. + m.o.)/n())

@dromano
Actually, in this set up, I assumed that they will remain sick (not true for an acute disease but reasonable for chronic diseases such as diabetes) so if someone gets diagnosed at a time point, they carry that status until the end of follow-up. Also, MI is intented to be 0 if no MI and 1 if MI. The 1 carries in the future time points because I assuming it is possible in the disease that I am studying. Does that make sense?

Thanks @simRock, and it looks like we were posting simultaneously: I think the solution I gave should work whether the disease is chronic or not.

Sim and Dromano,
please humour me, if you can explain what puts 3 people in the denominator for time period 4?
I just don't get the principle at play :sweat_smile:
I understand thatyou might exclude 2 people as having been diseased in period 3 and that would leave 3 people for period 4, but one of the 3 people is MI in period 4 (is that not relevent?. so shouldnt the denomitor be 2?
I suppose it hinges on whether you are dividing by people that 'could have got the attack' in the period in question. or people that could but didnt

Here's the output from the code I posted (before summarizing) for period 4:

# A tibble: 20 x 7
# Groups:   time, m.o. [6]
      id  time    MI MI_to_date  m.o.  f.o. incid.
   <int> <int> <dbl>      <dbl> <dbl> <dbl>  <dbl>
...
16     1     4     1          2     1     0  0    
17     2     4     1          3     1     0  0    
18     3     4     1          1     0     1  0.333
19     4     4     0          0     0     0  0.333
20     5     4     0          0     0     0  0.333

For incidence, the people with id's 1 and 2 have to be excluded since they had the disease before period 4. Does that make sense?

Right, so its that the divisor is the number of people that 'could have gotten the disease for the first time in that period'
Thanks for confirming :slight_smile:

1 Like

Just for completeness, heres my version of it from the dsn part forwards:

dsn2 <- group_by(dsn,id) %>%
  mutate(lag_MI=lag(MI),
         event = case_when( MI==1 & !(lag_MI==1) ~ "New",
                            MI==1 & (lag_MI==1)  ~ "Old",
                            TRUE ~ "0")) %>% group_by(time) %>%
summarise(indicidence = sum(event=="New")/sum(event %in% c("New","0")),
          prevalence = sum(event %in% c("New","Old")/n()))

Agreed with @ dromano.
Just to confirm again that the denominator for the incidence is the number of people that we call susceptibles (that is the number of people who could have gotten the disease)
Thank you all nirgrahamuk and dromano.

Quick question for nirgrahamuk.
What does group_by(dsn, id) do. I know what group_by(id) does. Thank you

the first parameter into group by is the dataframe to process. in this case dsn.
using the %>% has the effect of substitute whats on the left as the first argument to the function coming on the right
so

dsn2 <- dsn %>% group_by(id)
is equivalent to
dsn2 <- group_by(dsn,id)

its a matter only of style

Wonderful. That makes sense. Thanks so much