Mean according to specific year

Hello! I am trying to create a new variable, which is the mean of another variable and based on two other variables that are the criteria for the mean.
I explain. My aim to calculate the mean of GDP considering the years that a bill stayed in the parliament. So, I have one column which is the year of entrance in the Legislative and another column for the final year of approval (e.g. 2009-2010, 2008-2014, etc). And I have several other columns that are the GDP for each year (extracted from an online dataset).
On top of that, my data varies not only on time, but is clustered in 4 countries. So, I have the same bill for each country, needing to calculate the mean of GDP for each period, that is different in each observation.
I have tried to look for any code that could help me, but I haven't found any. So, any help is appreciated!

Not sure I understand how your data is organized. So I'm assuming you have 2 separate data.frames:

library(tidyverse)

# create dummy data
bills_df <- tibble(bill = LETTERS[1:3],
                   year_entrance = c(2009, 2008, 2015),
                   year_approval = c(2010, 2014, 2017))

gdp_df <- tibble(year = 2000:2019,
                 GDP = rnorm(20, 1000, 100))

Now, to make life easier we can make a function that computes the mean GDP over a period.

get_mean_gdp <- function(year_beg, year_end){
  stopifnot(year_beg < year_end)
  filter(gdp_df,
         year >= year_beg,
         year <= year_end) %>%
    pull(GDP) %>%
    mean()
}

And we just need to appy it for each row of the bills dataset.

bills_df %>%
  rowwise() %>%
  mutate(mean_gdp = get_mean_gdp(year_entrance, year_approval))

The countries should not be a problem as long as you have two separate rows for the same bill in two countries. Else you might need to do some pivot_longer() or the like.

I appreciate your response. I understand the function, but I can't apply it to my data, because it is a matrix, as in the following image
Capture

I get the following error:
Error: Problem with mutate() input mean_gdp.
x could not find function "get_mean_gdp"
i Input mean_gdp is get_mean_gdp(yearsubm, yearappr).
i The error occurred in row 1.
Run rlang::last_error() to see where the error occurred.

I guess that I'm missing a simple step, but I'm quite new to coding, so it is somehow challenging.

Thank you in advance

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

Thank you for the guide on reprex. I hope I have done it correctly

df <- data.frame(
stringsAsFactors = FALSE,
row.names = c("ASU_PRY","SAC_PRY","POP_PRY",
"VEN_PRY","IND_PRY","BOL2_PRY","EGY_PRY","ISR_PRY",
"POP_BRA","BOL1_URY","ASU_ARG","SAC_URY","ADH_PRY",
"PAR_PRY","EGY_URY"),
cntry = c("Paraguay","Paraguay",
"Paraguay","Paraguay","Paraguay","Paraguay","Paraguay",
"Paraguay","Brazil","Uruguay","Argentina","Uruguay",
"Paraguay","Paraguay","Uruguay"),
yearsubm = c(1991L,2010L,1995L,2007L,
2007L,2015L,2015L,2009L,1995L,2013L,1991L,2010L,
2006L,2006L,2013L),
yearappr = c(1991L,2012L,1995L,2013L,
2007L,2016L,2016L,2010L,1995L,2014L,1991L,2011L,
2006L,2006L,2015L)
)

get_mean_gdp <- data.frame(
stringsAsFactors = FALSE,
Year = c("1990","1991","1992","1993",
"1994","1995","1996","1997","1998","1999","2000",
"2001","2002","2003","2004"),
Argentina = c(204022.709576019,
225605.750803947,247277.638553345,261428.668260367,276686.170019566,
268813.868517326,283670.377245591,306679.014181442,
318486.704781525,307704.474172928,305276.689188273,
291817.529327736,260025.512893345,283004.073515091,
308558.133837269),
Brasil = c(1190328.7904512,
1202589.17699285,1196095.19543709,1254943.07905259,1328357.24917717,
1384413.92509244,1414185.73384553,1462195.16145589,
1467138.81261922,1474004.10627881,1538682.66125963,
1560068.7562445,1607704.86065684,1626046.02392186,
1719705.69987883),
Paraguay = c(15072.4009298585,
15444.4660701346,15706.4703187341,16481.7981450403,17358.286599311,
18542.6095659408,18834.4303983244,19633.4841408321,
19646.8423228185,19378.4507957219,18930.0062043638,
18772.11959474,18768.1015361058,19579.0234359506,
20373.4263282599),
Uruguay = c(22100.9277328455,
22883.0372408505,24698.0255693715,25354.3869873348,27200.5267832485,
26806.7722354053,28302.0426702529,29730.6489648803,
31074.1443152037,30471.5507314667,29883.4709399851,
28734.7114813469,26512.9415151065,26726.4459698004,
28063.8801838554)
)

mean_gdp <- function(year_beg, year_end){
stopifnot(year_beg < year_end)
filter(GDP,
year >= year_beg,
year <= year_end) %>%
pull(GDP) %>%
mean()
}

mydata %>%
rowwise() %>%
mutate(mean_gdp = get_mean_gdp(yearsubm, yearappr))

here is a solution you can try.
I use a library called fuzzyjoin to match between the years.

please not that I made the bill_name explicit rather than mere rownames on the bills_df

library(tidyverse)
library(fuzzyjoin)

bills_df <- data.frame(
  stringsAsFactors = FALSE,
  bill_name = c("ASU_PRY","SAC_PRY","POP_PRY",
                "VEN_PRY","IND_PRY","BOL2_PRY","EGY_PRY","ISR_PRY",
                "POP_BRA","BOL1_URY","ASU_ARG","SAC_URY","ADH_PRY",
                "PAR_PRY","EGY_URY"),
  cntry = c("Paraguay","Paraguay",
            "Paraguay","Paraguay","Paraguay","Paraguay","Paraguay",
            "Paraguay","Brazil","Uruguay","Argentina","Uruguay",
            "Paraguay","Paraguay","Uruguay"),
  yearsubm = c(1991L,2010L,1995L,2007L,
               2007L,2015L,2015L,2009L,1995L,2013L,1991L,2010L,
               2006L,2006L,2013L),
  yearappr = c(1991L,2012L,1995L,2013L,
               2007L,2016L,2016L,2010L,1995L,2014L,1991L,2011L,
               2006L,2006L,2015L)
) 

gdpwide <- data.frame(
  stringsAsFactors = FALSE,
  Year = c("1990","1991","1992","1993",
           "1994","1995","1996","1997","1998","1999","2000",
           "2001","2002","2003","2004"),
  Argentina = c(204022.709576019,
                225605.750803947,247277.638553345,261428.668260367,276686.170019566,
                268813.868517326,283670.377245591,306679.014181442,
                318486.704781525,307704.474172928,305276.689188273,
                291817.529327736,260025.512893345,283004.073515091,
                308558.133837269),
  Brasil = c(1190328.7904512,
             1202589.17699285,1196095.19543709,1254943.07905259,1328357.24917717,
             1384413.92509244,1414185.73384553,1462195.16145589,
             1467138.81261922,1474004.10627881,1538682.66125963,
             1560068.7562445,1607704.86065684,1626046.02392186,
             1719705.69987883),
  Paraguay = c(15072.4009298585,
               15444.4660701346,15706.4703187341,16481.7981450403,17358.286599311,
               18542.6095659408,18834.4303983244,19633.4841408321,
               19646.8423228185,19378.4507957219,18930.0062043638,
               18772.11959474,18768.1015361058,19579.0234359506,
               20373.4263282599),
  Uruguay = c(22100.9277328455,
              22883.0372408505,24698.0255693715,25354.3869873348,27200.5267832485,
              26806.7722354053,28302.0426702529,29730.6489648803,
              31074.1443152037,30471.5507314667,29883.4709399851,
              28734.7114813469,26512.9415151065,26726.4459698004,
              28063.8801838554)
)

gdpwide %>% pivot_longer(cols=-Year,
                              names_to="country",
                              values_to="gdp") -> gdp_long

master_dataset <- fuzzy_left_join(bills_df ,
                gdp_long,
                by=c("yearsubm"="Year",
                     "yearappr"="Year",
                     "cntry"="country"),
                match_fun = list(`>=`,`<=`,`==`)
)

master_dataset %>% group_by(bill_name,cntry) %>% summarise(mean_gdp=mean(gdp,na.rm=TRUE))

Thank you! When running the code you wrote with the smaller dataframe, it works. However, when I apply to all my data (I have included the bills as a column and not only rownames), I got this message:

Error: All columns in a tibble must be vectors.
x Column `col` is NULL.

I have tried to convert into vectors, but I still got the same message.

I don't know how to help you if I can't reproduce your issue...

Sorry for that. Here it goes:

mydata <- data.frame(
  stringsAsFactors = FALSE,
         row.names = c("ASU_PRY","SAC_PRY","POP_PRY",
                       "VEN_PRY","IND_PRY","BOL2_PRY","EGY_PRY","ISR_PRY",
                       "POP_BRA","BOL1_URY","ASU_ARG","SAC_URY","ADH_PRY",
                       "PAR_PRY","EGY_URY","SAC_ARG","BOL1_ARG","EGY_BRA",
                       "ISR_BRA","BOL2_URY","VEN_BRA","ISR_ARG","ADH_BRA",
                       "IND_URY","ISR_URY","IND_ARG","IND_BRA","PAR_BRA",
                       "EGY_ARG","SAC2_BRA","OLI_BRA","ADH_URY","USH_BRA",
                       "ADH_ARG","PAR_ARG","VEN_ARG","PAR_URY","VEN_URY",
                       "OLI_PRY","USH_ARG","POP_ARG","OLI_ARG","SAC_BRA","USH_URY",
                       "USH_PRY","OLI_URY","POP_URY","ASU_URY","ASU_BRA"),
              bill = c("ASU_PRY","SAC_PRY","POP_PRY",
                       "VEN_PRY","IND_PRY","BOL2_PRY","EGY_PRY","ISR_PRY",
                       "POP_BRA","BOL1_URY","ASU_ARG","SAC_URY","ADH_PRY",
                       "PAR_PRY","EGY_URY","SAC_ARG","BOL1_ARG","EGY_BRA",
                       "ISR_BRA","BOL2_URY","VEN_BRA","ISR_ARG","ADH_BRA",
                       "IND_URY","ISR_URY","IND_ARG","IND_BRA","PAR_BRA",
                       "EGY_ARG","SAC2_BRA","OLI_BRA","ADH_URY","USH_BRA",
                       "ADH_ARG","PAR_ARG","VEN_ARG","PAR_URY","VEN_URY",
                       "OLI_PRY","USH_ARG","POP_ARG","OLI_ARG","SAC_BRA","USH_URY",
                       "USH_PRY","OLI_URY","POP_URY","ASU_URY","ASU_BRA"),
             cntry = c("Paraguay","Paraguay",
                       "Paraguay","Paraguay","Paraguay","Paraguay","Paraguay",
                       "Paraguay","Brazil","Uruguay","Argentina","Uruguay",
                       "Paraguay","Paraguay","Uruguay","Argentina","Argentina",
                       "Brazil","Brazil","Uruguay","Brazil","Argentina",
                       "Brazil","Uruguay","Uruguay","Argentina","Brazil",
                       "Brazil","Argentina","Brazil","Brazil","Uruguay","Brazil",
                       "Argentina","Argentina","Argentina","Uruguay",
                       "Uruguay","Paraguay","Argentina","Argentina","Argentina",
                       "Brazil","Uruguay","Paraguay","Uruguay","Uruguay",
                       "Uruguay","Brazil"),
          yearsubm = c(1991L,2010L,1995L,2007L,
                       2007L,2015L,2015L,2009L,1995L,2013L,1991L,2010L,
                       2006L,2006L,2013L,2011L,2013L,2013L,2008L,2016L,2007L,
                       2009L,2006L,2008L,2008L,2007L,2005L,2006L,2017L,
                       2015L,2002L,2006L,1999L,2006L,2006L,2006L,2006L,
                       2006L,2002L,1998L,1995L,2002L,2010L,1998L,1998L,
                       2002L,1995L,1991L,1991L),
          yearappr = c(1991L,2012L,1995L,2013L,
                       2007L,2016L,2016L,2010L,1995L,2014L,1991L,2011L,
                       2006L,2006L,2015L,2012L,2014L,2015L,2009L,2016L,2009L,
                       2011L,2009L,2008L,2008L,2008L,2008L,2006L,2017L,
                       2015L,2004L,2008L,2001L,2006L,2006L,2006L,2006L,
                       2006L,2002L,1999L,1995L,2002L,2010L,1999L,1998L,
                       2003L,1995L,1991L,1991L)
)

gdp_long <- data.frame(
  stringsAsFactors = FALSE,
              year = c("1990","1990","1990","1990",
                       "1991","1991","1991","1991","1992","1992","1992",
                       "1992","1993","1993","1993","1993","1994","1994",
                       "1994","1994","1995","1995","1995","1995","1996",
                       "1996","1996","1996","1997","1997","1997","1997",
                       "1998","1998","1998","1998","1999","1999","1999",
                       "1999","2000","2000","2000","2000","2001","2001",
                       "2001","2001","2002","2002","2002","2002","2003","2003",
                       "2003","2003","2004","2004","2004","2004","2005",
                       "2005","2005","2005","2006","2006","2006","2006",
                       "2007","2007","2007","2007","2008","2008","2008",
                       "2008","2009","2009","2009","2009","2010","2010",
                       "2010","2010","2011","2011","2011","2011","2012",
                       "2012","2012","2012","2013","2013","2013","2013",
                       "2014","2014","2014","2014","2015","2015","2015",
                       "2015","2016","2016","2016","2016","2017","2017","2017",
                       "2017","2018","2018","2018","2018"),
           country = c("Argentina","Brazil",
                       "Paraguay","Uruguay","Argentina","Brazil","Paraguay",
                       "Uruguay","Argentina","Brazil","Paraguay","Uruguay",
                       "Argentina","Brazil","Paraguay","Uruguay","Argentina",
                       "Brazil","Paraguay","Uruguay","Argentina","Brazil",
                       "Paraguay","Uruguay","Argentina","Brazil","Paraguay",
                       "Uruguay","Argentina","Brazil","Paraguay","Uruguay",
                       "Argentina","Brazil","Paraguay","Uruguay","Argentina",
                       "Brazil","Paraguay","Uruguay","Argentina","Brazil",
                       "Paraguay","Uruguay","Argentina","Brazil","Paraguay",
                       "Uruguay","Argentina","Brazil","Paraguay","Uruguay",
                       "Argentina","Brazil","Paraguay","Uruguay","Argentina",
                       "Brazil","Paraguay","Uruguay","Argentina","Brazil",
                       "Paraguay","Uruguay","Argentina","Brazil","Paraguay",
                       "Uruguay","Argentina","Brazil","Paraguay","Uruguay",
                       "Argentina","Brazil","Paraguay","Uruguay","Argentina",
                       "Brazil","Paraguay","Uruguay","Argentina","Brazil",
                       "Paraguay","Uruguay","Argentina","Brazil","Paraguay",
                       "Uruguay","Argentina","Brazil","Paraguay","Uruguay",
                       "Argentina","Brazil","Paraguay","Uruguay",
                       "Argentina","Brazil","Paraguay","Uruguay","Argentina","Brazil",
                       "Paraguay","Uruguay","Argentina","Brazil",
                       "Paraguay","Uruguay","Argentina","Brazil","Paraguay",
                       "Uruguay","Argentina","Brazil","Paraguay","Uruguay"),
               gdp = c(204022.709576019,
                       1190328.7904512,15072.4009298585,22100.9277328455,225605.750803947,
                       1202589.17699285,15444.4660701346,22883.0372408505,
                       247277.638553345,1196095.19543709,15706.4703187341,
                       24698.0255693715,261428.668260367,1254943.07905259,
                       16481.7981450403,25354.3869873348,276686.170019566,
                       1328357.24917717,17358.286599311,27200.5267832485,
                       268813.868517326,1384413.92509244,18542.6095659408,26806.7722354053,
                       283670.377245591,1414185.73384553,18834.4303983244,
                       28302.0426702529,306679.014181442,1462195.16145589,
                       19633.4841408321,29730.6489648803,318486.704781525,
                       1467138.81261922,19646.8423228185,31074.1443152037,
                       307704.474172928,1474004.10627881,19378.4507957219,
                       30471.5507314667,305276.689188273,1538682.66125963,18930.0062043638,
                       29883.4709399851,291817.529327736,1560068.7562445,
                       18772.11959474,28734.7114813469,260025.512893345,
                       1607704.86065684,18768.1015361058,26512.9415151065,
                       283004.073515091,1626046.02392186,19579.0234359506,
                       26726.4459698004,308558.133837269,1719705.69987883,20373.4263282599,
                       28063.8801838554,335870.650498235,1774772.93573377,
                       20808.0914770334,30157.4827229858,362898.670600149,
                       1845089.2393129,21808.3608199,31393.5104782487,
                       395587.315892942,1957083.76872605,22990.7278981641,33447.120372214,
                       411637.215422162,2056781.44097077,24452.736056455,
                       35847.3341150742,387274.363587343,2054193.76311241,
                       24389.6549826671,37368.513657464,426487.434874632,
                       2208838.10859319,27107.573694974,40284.5318602359,
                       452093.534425313,2296626.68016874,28259.404602775,42364.0729820816,
                       447453.153875727,2340748.92041541,28107.2240515083,
                       43862.9895982757,458215.851022343,2411084.2746306,
                       30473.148310721,45897.1526850981,446702.64932211,
                       2423235.07223542,31954.4146425446,47383.6656404033,
                       458902.812680434,2337312.89010631,32938.7301255886,47559.3364376997,
                       449356.129667934,2260744.4619857,34359.3462542358,
                       48362.9932304934,462021.24554028,2290651.15086124,
                       36061.4300076708,49616.242186666,450168.774848251,
                       2320824.15750549,37271.7135902413,50420.0668036221)
)

master_dataset <- fuzzy_left_join(mydata,
                                  gdp_long,
                                  by=c("yearsubm"="Year",
                                       "yearappr"="Year",
                                       "cntry"="country"),
                                  match_fun = list(`>=`,`<=`,`==`)
)

master_dataset %>% group_by(bill,cntry) %>% summarise(mean_gdp=mean(gdp,na.rm=TRUE))

1 Like

Thanks, it seems the gdp_long dataframe has column year instead of Year, so either it should be renamed in the frame before the fuzzy left join, or else the fuzzy left join should refer you year rather than Year. but the function call should match the dataframe contents :slight_smile:

Indeed, this worked for running the code. However, it can't calculate the mean over time (which is what I'm looking for). It gets a result only for those bills that stayed in the parliament less than one year (yearsubm and yearappr are the same), indicating the GDP in that year. For those that stayed longer (more than one value for gdp to be calculated), it appears 'Not a Number'.
I have been trying variations in the code, but every time either I get the same result or the code doesn't run.

In the console (for the same code that I sent last time), I got:

`summarise()` regrouping output by 'bill' (override with `.groups` argument)
# A tibble: 49 x 3
# Groups:   bill [49]
   bill     cntry     mean_gdp
   <chr>    <chr>        <dbl>
 1 ADH_ARG  Argentina  362899.
 2 ADH_BRA  Brazil        NaN 
 3 ADH_PRY  Paraguay    21808.
 4 ADH_URY  Uruguay       NaN 
 5 ASU_ARG  Argentina  225606.
 6 ASU_BRA  Brazil    1202589.
 7 ASU_PRY  Paraguay    15444.
 8 ASU_URY  Uruguay     22883.
 9 BOL1_ARG Argentina     NaN 
10 BOL1_URY Uruguay       NaN 
# ... with 39 more rows

It seems my intuition for the order in which the match_fun's are applied was wrong. so you could change them.



numframe <- tibble(
  from=c(2,4,6),
  to =c(3,4,9))

numlong <- enframe(1:10,name=NULL)

#not right
fuzzy_left_join(numframe,
                numlong,
                by=c("from"="value",
                     "to"="value"),
                match_fun = list(`>=`,`<=`))

#right
fuzzy_left_join(numframe,
                numlong,
                by=c("from"="value",
                     "to"="value"),
                match_fun = list(`<=`,`>=`))

It runs, but it doesn't calculate the mean of the period. It extracts one of the values. This is also true in your last example.
My guess is that when merging, only one of the years is incorporated into the main data frame, leaving the others that are necessary to calculate.

mydata <- fuzzy_left_join(mydata ,
                          gdp_long,
                          by=c("yearsubm"="year",
                               "yearappr"="year",
                               "cntry"="country"),
                          match_fun = list(`<=`,`>=`,`==`)
)

The first answer in this post suggested a function that seems to be logically corrected:

get_mean_gdp <- function(year_beg, year_end){
  stopifnot(year_beg < year_end)
  filter(gdp_df,
         year >= year_beg,
         year <= year_end) %>%
    pull(gdp) %>%
    mean()
}

mydata %>%
  rowwise() %>%
  mutate(mean_gdp = get_mean_gdp(yearsubm, yearappr))

Yet, it doesn't work, generating the following error:

Error: Problem with `mutate()` input `mean_gdp`.
x year_beg < year_end is not TRUE
i Input `mean_gdp` is `get_mean_gdp(yearsubm, yearappr)`.
i The error occurred in row 1.

really ?
but ...

numframe <- tibble(
  from=c(2,4,6),
  to =c(3,4,9))

numlong <- enframe(1:10,name=NULL)


#right -- with mean 
fuzzy_left_join(numframe,
                numlong,
                by=c("from"="value",
                     "to"="value"),
                match_fun = list(`<=`,`>=`)) %>% 
  group_by(from,to) %>% 
  summarise(mean_val=mean(value,na.rm=TRUE))

this clearly works, what am I missing ?

Okay, now it's working perfectly fine.

I very much appreciated your help! And I leave this part of the code to those who face the same issue:

gdp_df <- read_excel("Economic indicators/historicseries/CEPAL/gdpNEW.xlsx")

gdp_df %>% pivot_longer(cols=-year,
                         names_to="country",
                         values_to="gdp") -> gdp_long

gdp_mean <- fuzzy_left_join(mydata,
                          gdp_long,
                          by=c("yearsubm"="year",
                               "yearappr"="year",
                               "cntry"="country"),
                          match_fun = list(`<=`,`>=`,`==`))%>% 
  group_by(bill,cntry) %>% 
  summarise(mean_gdp=mean(gdp,na.rm=TRUE))

mydata <- left_join(mydata, gdp_mean, by = c("cntry", "bill"))

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.