Aggregating rows to create a new data frame

Dany · May 2, 2020, 4:53pm

Hi everyone,
I just started to learn R and I'm having an issue. I have a data frame, a very simple one, of three columns of factor and integers.

library(tidyverse)
head(df,5)
#>                                               
#> 1 function (x, df1, df2, ncp, log = FALSE)    
#> 2 {                                           
#> 3     if (missing(ncp))                       
#> 4         .Call(C_df, x, df1, df2, log)       
#> 5     else .Call(C_dnf, x, df1, df2, ncp, log)

data.frame(
         age = c(10L, 11L, 12L, 13L, 14L),
    suicides = c(0L, 0L, 2L, 2L, 1L),
        year = as.factor(c("1981", "1981", "1981", "1981", "1981"))
)
#>   age suicides year
#> 1  10        0 1981
#> 2  11        0 1981
#> 3  12        2 1981
#> 4  13        2 1981
#> 5  14        1 1981

^{Created on 2020-05-02 by the reprex package (v0.3.0)}

I would like to create a new dataframe where the columns "year" still remain the same, the columns "age" would show an interval like "10-19", "20-29" and so on, and the columns "suicides" would show the cumulative sum of values ranging from age 10 to age 19 then from age 20 to age 29 and so on.
Keep in mind that year span from 1981 to 2017 and age from 10 to 89 for all the year variable.

I think maybe it's an easy task but I didn' t find anything for now on internet.

I hope I've been clear enough explaining my issue and thanks in advance for your help!

mfherman · May 2, 2020, 5:35pm

Hi @Dany and welcome! Thanks for your question and for including some sample data! There are a few ways to do this, but I like to use the cut() function, which essentially creates bins from a continuous variable.

I generated some more fake data that sounds like it matches your data structure so you can see how it works. The first step in the pipeline is to create a new variable which is the age categories using cut. This will assign one age bin to each of the rows in your original data. Then you can group by that new age category variable and sum the suicides.

library(tidyverse)

df <- tibble(
  age = runif(100) %/% 0.01,
  suicides = rpois(100, 5),
  year = rep(1981:2000, each = 5)
)

df %>% 
  mutate(age_cat = cut(age, c(-Inf, seq(10, 90, 10), Inf))) %>%
  group_by(age_cat) %>% 
  summarize(suicides = sum(suicides))
#> # A tibble: 10 x 2
#>    age_cat   suicides
#>  * <fct>        <int>
#>  1 (-Inf,10]       79
#>  2 (10,20]         49
#>  3 (20,30]         53
#>  4 (30,40]         71
#>  5 (40,50]         13
#>  6 (50,60]         47
#>  7 (60,70]         84
#>  8 (70,80]         64
#>  9 (80,90]         19
#> 10 (90, Inf]       28

^{Created on 2020-05-02 by the reprex package (v0.3.0)}

Dany · May 2, 2020, 6:03pm

Thank you very much for your fast and kind answer @mfherman.
You're the one who got close enough to what I'm looking for.
What I really want is something like this:

year	age	suicides
1981	10-19	197
1981	20-29	752
1981	30-39	884
1981	40-49	893
1981	50-59	1094
1981	60-69	943
1981	70-79	720
1981	80-89	226
1982	10-19	158
1982	20-29	755
1982	30-39	873
... ... ... and so on

Is this possible? I thought of doing it manually building up vectors for each age gap but that would be extrermely long and tiresome
I also thought of a for loop but I'm not expert enough to do so.

mfherman · May 2, 2020, 6:16pm

The really nice thing about using the group_by() %>% summarize() paradigm is that you can easily add more grouping levels. So all you need to modify in my code above is to add year as a grouping variable.

library(tidyverse)

df <- tibble(
  age = runif(10000) %/% 0.01,
  suicides = rpois(10000, 5),
  year = rep(1981:2000, each = 500)
)

df %>% 
  mutate(age_cat = cut(age, c(-Inf, seq(10, 90, 10), Inf))) %>%
  group_by(year, age_cat) %>% 
  summarize(
    n = n(),
    suicides = sum(suicides)
    )
#> # A tibble: 200 x 4
#> # Groups:   year [20]
#>     year age_cat       n suicides
#>    <int> <fct>     <int>    <int>
#>  1  1981 (-Inf,10]    56      286
#>  2  1981 (10,20]      47      222
#>  3  1981 (20,30]      39      202
#>  4  1981 (30,40]      53      271
#>  5  1981 (40,50]      62      291
#>  6  1981 (50,60]      48      257
#>  7  1981 (60,70]      52      262
#>  8  1981 (70,80]      55      275
#>  9  1981 (80,90]      52      262
#> 10  1981 (90, Inf]    36      186
#> # … with 190 more rows

^{Created on 2020-05-02 by the reprex package (v0.3.0)}

Dany · May 2, 2020, 6:42pm

@mfherman thank you very very much!! It worked! I didn't even think of using the cut function to split the age variable into intervals.
This issue annoyed me for too much, so again really thank you!!
It's nice having a community to reley on when problems arise

mfherman · May 2, 2020, 6:45pm

So glad it helped! Just for fun, there is a package called santoku that has a lot of the same functionality as cut(), but with some nice extra features. Check it out if you need to do a lot of this kind of data manipulation.

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

Dany · May 3, 2020, 9:56am

Hi @mfherman I marked the answer as Solution and I will sure take a look at the santoku package.
Cheers and thanks again for everything!

system · May 10, 2020, 9:56am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.