Aggregating rows to create a new data frame

Hi everyone,
I just started to learn R and I'm having an issue. I have a data frame, a very simple one, of three columns of factor and integers.

library(tidyverse)
head(df,5)
#>                                               
#> 1 function (x, df1, df2, ncp, log = FALSE)    
#> 2 {                                           
#> 3     if (missing(ncp))                       
#> 4         .Call(C_df, x, df1, df2, log)       
#> 5     else .Call(C_dnf, x, df1, df2, ncp, log)

data.frame(
         age = c(10L, 11L, 12L, 13L, 14L),
    suicides = c(0L, 0L, 2L, 2L, 1L),
        year = as.factor(c("1981", "1981", "1981", "1981", "1981"))
)
#>   age suicides year
#> 1  10        0 1981
#> 2  11        0 1981
#> 3  12        2 1981
#> 4  13        2 1981
#> 5  14        1 1981

Created on 2020-05-02 by the reprex package (v0.3.0)

I would like to create a new dataframe where the columns "year" still remain the same, the columns "age" would show an interval like "10-19", "20-29" and so on, and the columns "suicides" would show the cumulative sum of values ranging from age 10 to age 19 then from age 20 to age 29 and so on.
Keep in mind that year span from 1981 to 2017 and age from 10 to 89 for all the year variable.

I think maybe it's an easy task but I didn' t find anything for now on internet.

I hope I've been clear enough explaining my issue and thanks in advance for your help!

Hi @Dany and welcome! Thanks for your question and for including some sample data! There are a few ways to do this, but I like to use the cut() function, which essentially creates bins from a continuous variable.

I generated some more fake data that sounds like it matches your data structure so you can see how it works. The first step in the pipeline is to create a new variable which is the age categories using cut. This will assign one age bin to each of the rows in your original data. Then you can group by that new age category variable and sum the suicides.

library(tidyverse)

df <- tibble(
  age = runif(100) %/% 0.01,
  suicides = rpois(100, 5),
  year = rep(1981:2000, each = 5)
)

df %>% 
  mutate(age_cat = cut(age, c(-Inf, seq(10, 90, 10), Inf))) %>%
  group_by(age_cat) %>% 
  summarize(suicides = sum(suicides))
#> # A tibble: 10 x 2
#>    age_cat   suicides
#>  * <fct>        <int>
#>  1 (-Inf,10]       79
#>  2 (10,20]         49
#>  3 (20,30]         53
#>  4 (30,40]         71
#>  5 (40,50]         13
#>  6 (50,60]         47
#>  7 (60,70]         84
#>  8 (70,80]         64
#>  9 (80,90]         19
#> 10 (90, Inf]       28

Created on 2020-05-02 by the reprex package (v0.3.0)

Thank you very much for your fast and kind answer @mfherman.
You're the one who got close enough to what I'm looking for.
What I really want is something like this:

year age suicides
1981 10-19 197
1981 20-29 752
1981 30-39 884
1981 40-49 893
1981 50-59 1094
1981 60-69 943
1981 70-79 720
1981 80-89 226
1982 10-19 158
1982 20-29 755
1982 30-39 873
... ... ... and so on

Is this possible? I thought of doing it manually building up vectors for each age gap but that would be extrermely long and tiresome :slight_smile:
I also thought of a for loop but I'm not expert enough to do so.

The really nice thing about using the group_by() %>% summarize() paradigm is that you can easily add more grouping levels. So all you need to modify in my code above is to add year as a grouping variable.

library(tidyverse)

df <- tibble(
  age = runif(10000) %/% 0.01,
  suicides = rpois(10000, 5),
  year = rep(1981:2000, each = 500)
)

df %>% 
  mutate(age_cat = cut(age, c(-Inf, seq(10, 90, 10), Inf))) %>%
  group_by(year, age_cat) %>% 
  summarize(
    n = n(),
    suicides = sum(suicides)
    )
#> # A tibble: 200 x 4
#> # Groups:   year [20]
#>     year age_cat       n suicides
#>    <int> <fct>     <int>    <int>
#>  1  1981 (-Inf,10]    56      286
#>  2  1981 (10,20]      47      222
#>  3  1981 (20,30]      39      202
#>  4  1981 (30,40]      53      271
#>  5  1981 (40,50]      62      291
#>  6  1981 (50,60]      48      257
#>  7  1981 (60,70]      52      262
#>  8  1981 (70,80]      55      275
#>  9  1981 (80,90]      52      262
#> 10  1981 (90, Inf]    36      186
#> # … with 190 more rows

Created on 2020-05-02 by the reprex package (v0.3.0)

1 Like

@mfherman thank you very very much!! It worked! I didn't even think of using the cut function to split the age variable into intervals.
This issue annoyed me for too much, so again really thank you!!
It's nice having a community to reley on when problems arise :slight_smile:

So glad it helped! Just for fun, there is a package called santoku that has a lot of the same functionality as cut(), but with some nice extra features. Check it out if you need to do a lot of this kind of data manipulation.

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

Hi @mfherman I marked the answer as Solution and I will sure take a look at the santoku package.
Cheers and thanks again for everything!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.