Summation of row based on different id and same date

I have big data set and i am trying to do some calculations. I want to produce some calculations based on the row sum based on the same date with different IDs and store their value.
ex for data set:-

gvkey(Id) date a b
0001 1/1/2000 15624000 10011531920
0001 2/1/2000 18480000 10592724720
0001 3/1/2000 16968000 10160611743
0002 1/1/2000 9375333997 10011531920
0002 2/1/2000 9943536057 10592724720
0002 3/1/2000 9470034340 10160611743
0003 1/1/2000 441665123.3 10011531920
0003 2/1/2000 412041243.1 10592724720
0003 3/1/2000 417427403.1 10160611743

I am trying to do this code but values are coming constantly same for me.

library(readxl)
library(tidyverse)
library(plyr)
df<-NewData
head(NewData)

a<-NewData %>%
  mutate(firmMarket=NewData$cshoc*NewData$prccd)
  
head(a)
c<-ddply(b,.(gvkey,datadate),summarize,TotalMarket=sum(firmMarket))

Kindly help will be appreciated. Thanks in advance.

Can you make your data reproducible next time?

Your data:

library(tidyverse)
NewData <- tibble::tribble(
  ~gvkey,        ~date,          ~a,          ~b,
            1L, "01/01/2000",    15624000, 10011531920,
            1L, "02/01/2000",    18480000, 10592724720,
            1L, "03/01/2000",    16968000, 10160611743,
            2L, "01/01/2000",  9375333997, 10011531920,
            2L, "02/01/2000",  9943536057, 10592724720,
            2L, "03/01/2000",  9470034340, 10160611743,
            3L, "01/01/2000", 441665123.3, 10011531920,
            3L, "02/01/2000", 412041243.1, 10592724720,
            3L, "03/01/2000", 417427403.1, 10160611743
  )

a<-NewData %>%
mutate(firmMarket=NewData$cshoc*NewData$prccd)

Which column is meant to be cschoc and which one is meant to be prccd? Are they from somewhere else?

yes, that is the calculation for frimMarket which is "a" in my mentioned data set. You can just ignore that code. Problem is in next code.

Thanks

So what exactly are you after? Just the sum of each day for a and b?

yes, sum of a for all dates and id's

What, like this? You should specify the example better in your question so that people can answer it.

library(tidyverse)
library(lubridate)
NewData <- tibble::tribble(
  ~gvkey,        ~date,          ~a,          ~b,
  1L, "01/01/2000",    15624000, 10011531920,
  1L, "02/01/2000",    18480000, 10592724720,
  1L, "03/01/2000",    16968000, 10160611743,
  2L, "01/01/2000",  9375333997, 10011531920,
  2L, "02/01/2000",  9943536057, 10592724720,
  2L, "03/01/2000",  9470034340, 10160611743,
  3L, "01/01/2000", 441665123.3, 10011531920,
  3L, "02/01/2000", 412041243.1, 10592724720,
  3L, "03/01/2000", 417427403.1, 10160611743
) %>% 
  mutate(date = dmy(date))

NewData %>% 
  group_by(date) %>% 
  summarise(a = sum(a), b = sum(b))
# A tibble: 3 x 3
  date                  a           b
  <chr>             <dbl>       <dbl>
1 01/01/2000  9832623120. 30034595760
2 02/01/2000 10374057300. 31778174160
3 03/01/2000  9904429743. 30481835229

NO, ok let me try to make it more clear, sorry for messy writing.
Now i want to have sum of "a" based on date and id and "b" should be output as shown in my data set. So in "a" i am trying to do sum of the first id
i.e,

~gvkey,        ~date,          ~a,          
            1L, "01/01/2000",    15624000, 

and 2nd id of same date

~gvkey,        ~date,          ~a,        
                       2L, "01/01/2000",  9375333997,

and out will be stored in b as answer

~gvkey,        ~date,          ~a,          ~b,
            1L, "01/01/2000",    15624000, 10011531920,
             2L, "01/01/2000",  9375333997, 10011531920,

and so on for all dates and id's.

It would something like this, but you don't really have enough data for it to look useful because it isn't summarising anything. I have changed the format of the date field in the code above as well.

NewData %>% 
  group_by(gvkey, date) %>% 
  summarise(a = sum(a), b = sum(b)) %>% 
  arrange(date, gvkey)

# A tibble: 9 x 4
# Groups:   gvkey [3]
  gvkey date                 a           b
  <int> <date>           <dbl>       <dbl>
1     1 2000-01-01   15624000  10011531920
2     2 2000-01-01 9375333997  10011531920
3     3 2000-01-01  441665123. 10011531920
4     1 2000-01-02   18480000  10592724720
5     2 2000-01-02 9943536057  10592724720
6     3 2000-01-02  412041243. 10592724720
7     1 2000-01-03   16968000  10160611743
8     2 2000-01-03 9470034340  10160611743
9     3 2000-01-03  417427403. 10160611743

ok let me give me your real data:

gvkey datadate    cshoc prccd
  <dbl>    <dbl>    <dbl> <dbl>
1  7041 20000103 42000000  3.86
2  7041 20000104 42000000  4.28
3  7041 20000105 42000000  4   
4  7041 20000106 42000000  4.04
5  7041 20000111 42000000  4   
6  7041 20000112 42000000  3.96

now I want to add a new column with calculation as

b<-NewData%>% 
  mutate(firmMarket=NewData$cshoc*NewData$prccd)%>%
  group_by(gvkey,datadate)%>%
  mutate(TotalMarket=sum(firmMarket),na.rm=TRUE)%>%
  mutate(weightage=firmMarket/TotalMarket)%>%
  mutate(PriceWeightage=weightage*prccd)%>%
  group_by(gvkey,datadate)%>%
  mutate(MarketIndex=sum(PriceWeightage,na.rm = TRUE))

This is my output

 gvkey datadate    cshoc prccd firmMarket TotalMarket na.rm weightage PriceWeightage MarketIndex
  <dbl>    <dbl>    <dbl> <dbl>      <dbl>       <dbl> <lgl>     <dbl>          <dbl>       <dbl>
1  7041 20000103 42000000  3.86  162120000          NA TRUE         NA             NA           0
2  7041 20000104 42000000  4.28  179760000          NA TRUE         NA             NA           0
3  7041 20000105 42000000  4     168000000          NA TRUE         NA             NA           0
4  7041 20000106 42000000  4.04  169680000          NA TRUE         NA             NA           0
5  7041 20000111 42000000  4     168000000          NA TRUE         NA             NA           0
6  7041 20000112 42000000  3.96  166320000          NA TRUE         NA             NA           0

Please help me in getting values. calculation are as follows:

TotalMarket at t = Sum of firmMarket at t,I
Weightage at t,i = firmMarket /TotalMarket
PriceWightage = pprccd*Weightage

Again, please provide a reproducible example:

Use dput() or reprex()

1 Like

na.rm is a param for sum not for mutate.

Thanks for your input.
Here is a new output

gvkey datadate    cshoc prccd firmMarket TotalMarket    weightage PriceWeightage MarketIndex
  <dbl>    <dbl>    <dbl> <dbl>      <dbl>       <dbl>        <dbl>          <dbl>       <dbl>
1  7041 20000103 42000000  3.86  162120000     4.33e15 0.0000000375    0.000000145        17.7
2  7041 20000104 42000000  4.28  179760000     4.33e15 0.0000000415    0.000000178        17.7
3  7041 20000105 42000000  4     168000000     4.33e15 0.0000000388    0.000000155        17.7
4  7041 20000106 42000000  4.04  169680000     4.33e15 0.0000000392    0.000000158        17.7
5  7041 20000111 42000000  4     168000000     4.33e15 0.0000000388    0.000000155        17.7
6  7041 20000112 42000000  3.96  166320000     4.33e15 0.0000000384    0.000000152        17.7

but now issue is, value for TotalMarket is same for all . That should not be same. Kindly help.

Did you omit grouping by ?

No is used grouped by

b<-NewData%>% 
  mutate(firmMarket=NewData$cshoc*NewData$prccd)%>%
  group_by(gvkey,datadate)%>%
  mutate(TotalMarket=sum(firmMarket,na.rm=TRUE))%>%
  mutate(weightage=firmMarket/TotalMarket)%>%
  mutate(PriceWeightage=weightage*prccd)%>%
  group_by(gvkey,datadate)%>%
  mutate(MarketIndex=sum(PriceWeightage,na.rm = TRUE))

And does this work?

mpg %>%
select(model,hwy) %>%
 group_by(model) %>%
 mutate(sum_hwy_by_model = sum(hwy,na.rm=TRUE))
# A tibble: 234 x 3
# Groups:   model [38]
   model        hwy sum_hwy_by_model
   <chr>      <int>           <int>
 1 a4            29             198
 2 a4            29             198
 3 a4            31             198
 4 a4            30             198
 5 a4            26             198
 6 a4            26             198
 7 a4            27             198
 8 a4 quattro    26             206
 9 a4 quattro    25             206
10 a4 quattro    28             206
...

its giving same output.

How about you give me some of NewData, and we see if grouped sums can be calculated.?

Ok, NewData look like this with many more gvkeys, these are in the monthly data set, but have a daily data set. Along with that i have a very big data set I tried to do dout() it took a long process to compile and too much information to be displayed here.

gvkey datadate cshoc prccd
7041 31/1/2000 42000000 3.72
7041 29/2/2000 42000000 4.4
7041 31/3/2000 42000000 4.04
7041 28/4/2000 42000000 3.44
7041 31/5/2000 42000000 3.24
15632 31/1/2000 2367508585 3.96
15632 29/2/2000 2367508585 4.2
15632 31/3/2000 2367508585 4
15632 28/4/2000 2367508585 3.96
15632 31/5/2000 2367508585 3.82
203244 31/1/2000 269308002 1.64
203244 29/2/2000 269308002 1.53
203244 31/3/2000 269308002 1.55
203244 28/4/2000 269308002 1.45
203244 29/5/2000 269308002 1.42
212452 31/1/2000 17095000 2.24
212452 29/2/2000 17107000 3.06
212452 31/3/2000 17245000 6
212452 28/4/2000 17356000 7.5
212452 31/5/2000 17515000 10.9
now based on this i need to create new column named "firmMarket", "TotalMarket", "weightage", "PriceWeigtage". So the major problem is in calculating "TotalMarket". Kindly refer to the image i am uploading in this sum is required based on time and id.

Kindly help in solving this problem. Thanks in advance.

Hello,
I'm sure you shared this image with the best intentions, but perhaps you didnt realise what it implies.
If someone wished to use example data to test code against, they would type it out from your screenshot...

This is very unlikely to happen, and so it reduces the likelihood you will receive the help you desire.
Therefore please see this guide on how to reprex data. Key to this is use of either datapasta, or dput() to share your data as code

gvkey datadate cshoc prccd
7041 31/1/2000 42000000 3.72
7041 29/2/2000 42000000 4.4
7041 31/3/2000 42000000 4.04
7041 28/4/2000 42000000 3.44
7041 31/5/2000 42000000 3.24
15632 31/1/2000 2367508585 3.96
15632 29/2/2000 2367508585 4.2
15632 31/3/2000 2367508585 4
15632 28/4/2000 2367508585 3.96
15632 31/5/2000 2367508585 3.82
203244 31/1/2000 269308002 1.64
203244 29/2/2000 269308002 1.53
203244 31/3/2000 269308002 1.55
203244 28/4/2000 269308002 1.45
203244 29/5/2000 269308002 1.42
212452 31/1/2000 17095000 2.24
212452 29/2/2000 17107000 3.06
212452 31/3/2000 17245000 6
212452 28/4/2000 17356000 7.5
212452 31/5/2000 17515000 10.9

Sir, this is the data set. Image i just used to show how the sum is done in TotalMarket. You can use this data it is not in image formate.
Thanks