Create a new "sumif" variable

Hi everyone,

I would like to create a new variable in a dataframe, essentially using the equivalent of a "sumif function" in excel.

I have a dataframe which looks like this (but is much larger):

Country   Year       Number of IMF programs during the year
  AFG     2000                       1
  ARG     2000                       1
  AFG     2001                       0
  ARG     2001                       1

I would like to create a new variable "number of IMF programs over the period": the sum of the number of programs in the country, over the period.

In the example, we can see from the column "IMF programs during the year" that Afghanistan had 1 program over the period (1 in 2000, 0 in 2001), and Argentina had 2.
so the table would look like this:

Country   Year  IMF programs during the year    N of programs over the period
  BRA     2000                 1                               1
  ARG     2000                 1                               2
  BRA     2001                 0                               1
  ARG     2001                 1                               2

I have seen tutorials, but they only show how to create subsets of observations matching certain criteria, and then calculate sums within the subsets. I would be able to do this, but it is not what I need. I would need another variable...

Would you be able to provide me with some guidance ?

Many thanks for your help !

I do not understand where the values in the new column come from. Are they just a running seq within each year independent of the value in "IMF programs during the year"?

Sorry - it is the sum of the number of programs in the country, over the period.
Afghanistan had 1 program over the period, Argentina had 2.

Basically a sum of the number of programs per year, in the corresponding country.

I edited my initial question so everyone could understand better.

Instead of using group_by and summarize I used group_by and mutate.

DF <- data.frame(Country = c("B", "A", "B", "A"),
                 Year = c(2000, 2000, 2001, 2001),
                 IMF_Prg = c(1,1,0,1))
DF
#>   Country Year IMF_Prg
#> 1       B 2000       1
#> 2       A 2000       1
#> 3       B 2001       0
#> 4       A 2001       1
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
DF %>% group_by(Country) %>% mutate(PrgInPeriod = sum(IMF_Prg))
#> # A tibble: 4 x 4
#> # Groups:   Country [2]
#>   Country  Year IMF_Prg PrgInPeriod
#>   <fct>   <dbl>   <dbl>       <dbl>
#> 1 B        2000       1           1
#> 2 A        2000       1           2
#> 3 B        2001       0           1
#> 4 A        2001       1           2

Created on 2020-04-29 by the reprex package (v0.3.0)

Thank you so much for the effort you've put in, it's incredible and I'm very grateful.
It nearly works, as it does create a new column with the sum of programs... but it sums programs for all countries !

hence the last column is:

PrgInPeriod
4
4
4
4

Do not worry about it though, I have found a (suboptimal, but good enough) solution:
I use

rowsum(dataset$IMF_Prg, dataset$country, na.rm = TRUE)

to see the number of programs per country, and then I will manually enter them into a vector (which I will then use as a column) when creating my dataframe.

Are you sure you included group_by(Country) %>% in your code? You'll get a sum over all countries only if the data aren't grouped by Country.

Hi Joels !
yes I did - strangely, R does not seem to notice it though... Oh well :man_shrugging:t3:

library(dplyr)
datasetyolo<- datasetyolo %>% group_by(country) %>% mutate(programperiod = sum(program, na.rm = TRUE))

In base R:

datayolo <- data.frame(Country=c("AFG", "ARG", "AFG", "ARG"), 
                       Year=c(2000,2000,2001,2001), 
                      `IMF Programs during the year`=c(1,1,0,1), 
                      check.names=FALSE)
datayolo[["N of programs over the period"]] <- tapply(datayolo[["IMF Programs during the year"]], 
                                                      datayolo$Country,
                                                      sum)[datayolo$Country]
datayolo
#>   Country Year IMF Programs during the year N of programs over the period
#> 1     AFG 2000                            1                             1
#> 2     ARG 2000                            1                             2
#> 3     AFG 2001                            0                             1
#> 4     ARG 2001                            1                             2

Those are backticks around the first IMF Programs during the year so it works inside the data.frame() function,
check.names = FALSE keeps dots from replacing the spaces in the name.
The others are standard double-quotes.
Since the OQ is for "sumif", the third argment to tapply is "sum", but any other function could be used, e.g. mean, median, sd.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.