Create a new "sumif" variable

remilsg · April 29, 2020, 6:42pm

Hi everyone,

I would like to create a new variable in a dataframe, essentially using the equivalent of a "sumif function" in excel.

I have a dataframe which looks like this (but is much larger):

Country   Year       Number of IMF programs during the year
  AFG     2000                       1
  ARG     2000                       1
  AFG     2001                       0
  ARG     2001                       1

I would like to create a new variable "number of IMF programs over the period": the sum of the number of programs in the country, over the period.

In the example, we can see from the column "IMF programs during the year" that Afghanistan had 1 program over the period (1 in 2000, 0 in 2001), and Argentina had 2.
so the table would look like this:

Country   Year  IMF programs during the year    N of programs over the period
  BRA     2000                 1                               1
  ARG     2000                 1                               2
  BRA     2001                 0                               1
  ARG     2001                 1                               2

I have seen tutorials, but they only show how to create subsets of observations matching certain criteria, and then calculate sums within the subsets. I would be able to do this, but it is not what I need. I would need another variable...

Would you be able to provide me with some guidance ?

Many thanks for your help !

FJCC · April 29, 2020, 6:50pm

I do not understand where the values in the new column come from. Are they just a running seq within each year independent of the value in "IMF programs during the year"?

remilsg · April 29, 2020, 7:20pm

Sorry - it is the sum of the number of programs in the country, over the period.
Afghanistan had 1 program over the period, Argentina had 2.

Basically a sum of the number of programs per year, in the corresponding country.

I edited my initial question so everyone could understand better.

FJCC · April 29, 2020, 8:20pm

Instead of using group_by and summarize I used group_by and mutate.

DF <- data.frame(Country = c("B", "A", "B", "A"),
                 Year = c(2000, 2000, 2001, 2001),
                 IMF_Prg = c(1,1,0,1))
DF
#>   Country Year IMF_Prg
#> 1       B 2000       1
#> 2       A 2000       1
#> 3       B 2001       0
#> 4       A 2001       1
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
DF %>% group_by(Country) %>% mutate(PrgInPeriod = sum(IMF_Prg))
#> # A tibble: 4 x 4
#> # Groups:   Country [2]
#>   Country  Year IMF_Prg PrgInPeriod
#>   <fct>   <dbl>   <dbl>       <dbl>
#> 1 B        2000       1           1
#> 2 A        2000       1           2
#> 3 B        2001       0           1
#> 4 A        2001       1           2

^{Created on 2020-04-29 by the reprex package (v0.3.0)}

remilsg · April 29, 2020, 8:54pm

Thank you so much for the effort you've put in, it's incredible and I'm very grateful.
It nearly works, as it does create a new column with the sum of programs... but it sums programs for all countries !

hence the last column is:

PrgInPeriod
4
4
4
4

Do not worry about it though, I have found a (suboptimal, but good enough) solution:
I use

rowsum(dataset$IMF_Prg, dataset$country, na.rm = TRUE)

to see the number of programs per country, and then I will manually enter them into a vector (which I will then use as a column) when creating my dataframe.

joels · April 29, 2020, 9:04pm

Are you sure you included group_by(Country) %>% in your code? You'll get a sum over all countries only if the data aren't grouped by Country.

remilsg · April 30, 2020, 8:17am

Hi Joels !
yes I did - strangely, R does not seem to notice it though... Oh well

library(dplyr)
datasetyolo<- datasetyolo %>% group_by(country) %>% mutate(programperiod = sum(program, na.rm = TRUE))

Scoco · April 30, 2020, 2:29pm

In base R:

datayolo <- data.frame(Country=c("AFG", "ARG", "AFG", "ARG"), 
                       Year=c(2000,2000,2001,2001), 
                      `IMF Programs during the year`=c(1,1,0,1), 
                      check.names=FALSE)
datayolo[["N of programs over the period"]] <- tapply(datayolo[["IMF Programs during the year"]], 
                                                      datayolo$Country,
                                                      sum)[datayolo$Country]
datayolo
#>   Country Year IMF Programs during the year N of programs over the period
#> 1     AFG 2000                            1                             1
#> 2     ARG 2000                            1                             2
#> 3     AFG 2001                            0                             1
#> 4     ARG 2001                            1                             2

Those are backticks around the first IMF Programs during the year so it works inside the data.frame() function,
check.names = FALSE keeps dots from replacing the spaces in the name.
The others are standard double-quotes.
Since the OQ is for "sumif", the third argment to tapply is "sum", but any other function could be used, e.g. mean, median, sd.

system · May 21, 2020, 2:30pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.