Aggregate different observations group_by ID

Dear R-community,
I am quiet new to R.
My dataframe with more than 800.000 patients looks like this:
df <- data.frame("ID" = c("1","1","1","1","1","2","2","2","3","3"), "therapies" = c("A51", "B32", "A67","A99","L37","A64","A51","L45","B32","A55"), "number of therapies" = c(8,2,6,1,7,15,3,2,9,10), "duration of therapie in min." = c(240, 120, 189, 30, 210, 450, 60,60, 180,400), "Total duration of inpatient treatment" = c(21,21,21,21,21,24,24,24,18,18))

  1. First I want to aggregate all A- therapies, all B-therapies and all L-therapies by ID to perform a descriptive stats on the three different therapies (and their numbers and durations)
  2. After that I want to sum up all therapies(A+B+L) (and their numbers and durations) by ID.
  3. To calculate the average duration and number of each ID I would use dplyr:: mutate (average_duration = aggregated duration/ total duration of inpatient treatment)Right?
  4. After aggregating everything (to get one ID per row) I want to merge this dataframe with another by ID.
    I would be very grateful if you could help me especially with the 1. and 2. Thanks in advance.

Hi, welcome to the forum Here is a very quick and dirty first cut at some of what I think you want. However if you have more than 800.000 patients I sespect than someof the coding especially for the therapies variable will be more complicated than your example.

In any case am I even on the right path for what you want?

To cut down on typing I have renamed your data.frame and variables

dat1 -> df # df is an actual namee of a function

id -> ID
therapy -> therapies
number -> number of therapies
duration -> duration of therapie in min.
total.dur -> Total duration of inpatient treatment


dat1 <- data.frame(id = c("1","1","1","1","1","2","2","2","3","3"), 
therapy = c("A51", "B32", "A67","A99","L37","A64","A51","L45","B32","A55"), 
number = c(8,2,6,1,7,15,3,2,9,10), duration = c(240, 120, 189, 30, 210, 450, 60,60, 180,400), 
total.dur = c(21,21,21,21,21,24,24,24,18,18))

library(tidyverse)
library("stringr")
dat1$alph <-   str_sub(dat1$therapy, 1, 1) # extrart alpha part of "therapy"
dat1$num <-   str_sub(dat1$therapy, - 2, - 1)  # extrart alpha part of "therapy". Not needed at the moment


dat2 <-    dat1  %>% group_by(alph, id)
     dat2 %>% summarise(mean.duration = mean(duration), sd.duration  = sd(duration), 
                        mean.total.dur = mean(total.dur), sd.total.dur, n = n())
     

Thank you for your quick reply!
Yes, it is a first step in the right direction.!
(Besides: In the last function: dat2 %>%(mean.duration....) I got an error)
After extracting the A,B and L (from therapy) it would be great to learn how to sum up the A's, B's and L's by ID. And in case this is too complicated, it would also help me to sum up the different variables (therapy, number, duration, total.dur) to get one ID per row. Is this possible? Thank you!

As a beginner to R you may benefit from studying this useful book.
https://r4ds.had.co.nz/

1 Like

My apologies on that stupid error. It was late here and I goofed in a cut & paste. I think this will work. Note the large numbers of NAs are due to the single data entry for that group.

dat2 <-    dat1  %>% group_by(alph, id)
dat2 %>% summarise(mean.duration = mean(duration), sd.duration  = sd(duration), 
                   mean.total.dur = mean(total.dur), sd.total.dur = sd(total.dur), n = n())

Can you explain a bit about the difference be tween your points 1 & 2? I am reading them as being the same so I definitely am bissing something.

Sure, in the first step I want to do descriptive statistics of therapy A, therapy B and therapy L independent of each other. (like mean and SD of therapy A, mean and SD of therapy B, mean and SD of therapy C, distribution gender/ age-categories in relation to the average duration/average number of the therapies...) Just to get an overview, to see the distributions and to compare the different therapies. In the second step I want to sum up all therapies group_by ID, all durations group_by ID and all numbers group_by ID to get one ID per row. If this is possible I would combine it with another dataframe (with sociodemographic variables) group_by ID to do a regression. Does it make it clearer what I want to do?

Day before national holiday here so decent response may be late. I think I may understand but I have to get final shopping done.

Dear jrkrideau,
I finally found a solution. You helped me with part of the solution. Thanks for that.
If you are interested:
dat1$alph <- str_sub(dat1$therapy, 1, 1)
dat2 <- dat1[, c('id', 'total.dur')]
dat2 <- dat2%>%
distinct(id, total.dur)
dat1 <- subset( dat1, select = -therapy )
dat1 <- subset(dat1, select = -total.dur)
dat1 <- dat1 %>%
group_by(id, alph)%>%
summarise(number = sum(number), duration = sum(duration))
dat1 <- subset(dat1, select = -alph)
dat1 <- dat1 %>%
group_by(id)%>%
summarise(total.number = sum(number), total.duration = sum(duration))
dat <- merge(dat1, dat2, by = "id")

I know it is a bit complicated and for sure there are easier ways but the main thing is that I have a solution.
Thank you! Sil

Great news. Good luck

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.