group by function not working | calculating year on year growth per company

gvkey <- c(1, 1, 1, 1, 2,2,2, 4, 4 )
Fyear <- c(2005,2006,2007,2008, 2007,2008,2009 , 2011,2012)
cashflow <- c(100, 110, 120, 130, 500, 550, 600, 50, 60)
lagAT <- c(1000,1500,1300,1200, 300,500, 800, 70, 40)

set1 <- data.frame(gvkey, Fyear, cashflow, lagAT)

set1 <- set1%>%
group_by(gvkey) %>%
arrange(Fyear) %>%
mutate(difcash = (cashflow - lag(cashflow)))

set1 <- set1%>%
group_by(gvkey) %>%
arrange(Fyear) %>%
mutate(yoy2= difcash/lagAT)

Dear everyone,

In my datasheet I am trying to calculate the year on year growth in cashflow for a number of companies throughout the years. I have been trying to calculate this difference via the code above.
But this method does not introduce a condition that he shouldn't use the 'cashflow' data if the 'gvkey' is different.

Does anyone know how I can add this condition into my code so that i can calculate the true growth?

Thank you in advance!

This returns just what I would expect. All I added is an arrange() at the end to make it easier to evaluate the calculations. Do you get something different?

library(dplyr)

gvkey <- c(1, 1, 1, 1, 2,2,2, 4, 4 )
Fyear <- c(2005,2006,2007,2008, 2007,2008,2009 , 2011,2012)
cashflow <- c(100, 110, 120, 130, 500, 550, 600, 50, 60)
lagAT <- c(1000,1500,1300,1200, 300,500, 800, 70, 40)

set1 <- data.frame(gvkey, Fyear, cashflow, lagAT)

set1 <- set1%>%
  group_by(gvkey) %>%
  arrange(Fyear) %>%
  mutate(difcash = (cashflow - lag(cashflow)))

set1 <- set1%>%
  group_by(gvkey) %>%
  arrange(Fyear) %>%
  mutate(yoy2= difcash/lagAT) %>% 
  arrange(gvkey, Fyear)

set1
#> # A tibble: 9 x 6
#> # Groups:   gvkey [3]
#>   gvkey Fyear cashflow lagAT difcash     yoy2
#>   <dbl> <dbl>    <dbl> <dbl>   <dbl>    <dbl>
#> 1     1  2005      100  1000      NA NA      
#> 2     1  2006      110  1500      10  0.00667
#> 3     1  2007      120  1300      10  0.00769
#> 4     1  2008      130  1200      10  0.00833
#> 5     2  2007      500   300      NA NA      
#> 6     2  2008      550   500      50  0.1    
#> 7     2  2009      600   800      50  0.0625 
#> 8     4  2011       50    70      NA NA      
#> 9     4  2012       60    40      10  0.25

Created on 2020-05-16 by the reprex package (v0.3.0)

I found out why it was not working at my Rstudio.
Apparently I had to install the package of dplyr and then use the library key to activate it.
If I only use the library function your method is not working.

Do you know what may be the root cause of this problem?

Sorry, I don't know what that means. What commands did you run?

tools -> install packages -> dplyr
after restarting I use command library(dplyr) and the code works as it should

It seems something was wrong with your dpylr package. It was installed, otherwise library(dplyr) would have thrown an error, but group_by was not working correctly. Unless something similar happens again, I would put it down as a weird computer mystery and not worry about it.

well it happens all the time actually
so everytime I run the code I now do first tools -> install dplyr -> library (dplyr)

OK, that is not how installing dplyr should work, so something is wrong. I do not know enough about how to debug such a problem to be helpful. I suggest you start a new thread with a a title like "Need to reinstall dplyr repeatedly", include a reprex, either the one you used in this thread or a simpler one taking the average of a few groups, and explain that the code only works after a fresh installation of dplyr. Mention whether group_by works until you terminate R, or if it fails after the first use, or whatever describes the situation. Sorry I can't help you more.

No problem,
Thanks a lot for all your help!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Thank you for your reply!

I don't get any changes to my subset if I copy paste your code (meaning he still doesn't take the condition into account).

library(dplyr)

gvkey <- c(1, 1, 1, 1, 2,2,2, 4, 4 )
Fyear <- c(2005,2006,2007,2008, 2007,2008,2009 , 2011,2012)
cashflow <- c(100, 110, 120, 130, 500, 550, 600, 50, 60)
lagAT <- c(1000,1500,1300,1200, 300,500, 800, 70, 40)

set1 <- data.frame(gvkey, Fyear, cashflow, lagAT)

set1 <- set1%>%
group_by(gvkey) %>%
arrange(Fyear) %>%
mutate(difcash = (cashflow - lag(cashflow)))

set2 <- set1%>%
group_by(gvkey) %>%
arrange(Fyear) %>%
mutate(yoy2= difcash/lagAT) %>%
arrange(gvkey, Fyear)

set2
A tibble: 9 x 6
Groups: gvkey [3]
gvkey Fyear cashflow lagAT difcash yoy2

1 1 2005 100 1000 NA NA
2 1 2006 110 1500 10 0.00667
3 1 2007 120 1300 10 0.00769
4 1 2008 130 1200 -370 -0.308
5 2 2007 500 300 380 1.27
6 2 2008 550 500 420 0.84
7 2 2009 600 800 50 0.0625
8 4 2011 50 70 -550 -7.86
9 4 2012 60 40 10 0.25

It is not clear to me what do you expect when you mention this condition, how should look like the desired output? Meanwhile I tried this way:

set1 <- set1%>%
group_by(gvkey) %>%
mutate(difcash = (cashflow - lag(cashflow)),
difcash = replace_na(difcash, 0))
set1 <- set1%>%
group_by(gvkey) %>%
mutate(yoy2= difcash/lagAT)

I do not have any good ideas about why the code would work for me and not for you. You can try the standard computer weirdness cure by exiting and restarting RStudio. Other than that, I am stumped.

Thank you for your reply.
In my desired output the year on year growth is represented in variable yoy2 per company per year. My calculation is calculating this growth per year, but doesn't look at the gvkey (unique company key).

For example,

while calculating the yoy growth for gvkey 2 in 2007, i want the result to be NA, since 2006 for this gvkey is unknown. This worked in the calculation of FJCC, but when I copy his code, I don't get the same result.

Thank you in advance!

No idea why is not working previous answer, but this worked to me, :

set1 <- set1%>%
group_by(gvkey) %>%
mutate(difcash = (cashflow - lag(cashflow)),
yoy2= difcash/lagAT)

set1

A tibble: 9 x 6

Groups: gvkey [3]

gvkey Fyear cashflow lagAT difcash yoy2

1 1 2005 100 1000 NA NA
2 1 2006 110 1500 10 0.00667
3 1 2007 120 1300 10 0.00769
4 1 2008 130 1200 10 0.00833
5 2 2007 500 300 NA NA
6 2 2008 550 500 50 0.1
7 2 2009 600 800 50 0.0625
8 4 2011 50 70 NA NA
9 4 2012 60 40 10 0.25