Summation of row based on different id and same date

Did you read the article? It has been posted three times in the thread, yet you still haven't done it.

`datapasta::df_paste(head(NewData, 10)[, c('gvkey', 'datadate','cshoc','prccd')])`

 data.frame(
        gvkey = c(7041, 7041, 7041, 7041, 7041, 7041, 7041, 7041, 7041, 7041),
     datadate = c(20000103,20000104,20000105,20000106,
                  20000111,20000112,20000113,20000114,20000117,20000118),
        cshoc = c(4.2e+07,4.2e+07,4.2e+07,4.2e+07,4.2e+07,
                  4.2e+07,4.2e+07,4.2e+07,4.2e+07,4.2e+07),
        prccd = c(3.86, 4.28, 4, 4.04, 4, 3.96, 3.92, 3.96, 4.06, 4.14)
 )

Is this a good example?.. they are all the same gvkey, and all entries have unique datadates ?

NewData<- data.frame(
  gvkey = c(7041, 7041, 7041, 7041, 7041, 7041, 7041, 7041, 7041, 7041),
  datadate = c(20000103,20000104,20000105,20000106,
               20000111,20000112,20000113,20000114,20000117,20000118),
  cshoc = c(4.2e+07,4.2e+07,4.2e+07,4.2e+07,4.2e+07,
            4.2e+07,4.2e+07,4.2e+07,4.2e+07,4.2e+07),
  prccd = c(3.86, 4.28, 4, 4.04, 4, 3.96, 3.92, 3.96, 4.06, 4.14)
)

mutate(NewData,
       firmMarket=cshoc*prccd)%>%
  group_by(gvkey,datadate)%>%
  mutate(TotalMarket=sum(firmMarket,na.rm=TRUE))%>%
  mutate(weightage=firmMarket/TotalMarket)
# A tibble: 10 x 7
# Groups:   gvkey, datadate [10]
   gvkey datadate    cshoc prccd firmMarket TotalMarket weightage
   <dbl>    <dbl>    <dbl> <dbl>      <dbl>       <dbl>     <dbl>
 1  7041 20000103 42000000  3.86 162120000   162120000          1
 2  7041 20000104 42000000  4.28 179760000   179760000          1
 3  7041 20000105 42000000  4    168000000   168000000          1
 4  7041 20000106 42000000  4.04 169680000   169680000          1
 5  7041 20000111 42000000  4    168000000   168000000          1
 6  7041 20000112 42000000  3.96 166320000   166320000          1
 7  7041 20000113 42000000  3.92 164640000   164640000          1
 8  7041 20000114 42000000  3.96 166320000   166320000          1
 9  7041 20000117 42000000  4.06 170520000.  170520000.         1
10  7041 20000118 42000000  4.14 173880000   173880000          1

There are more than 1000 gvkeys with different entries in the data set and have unique datadate. But in this example, if you see the value for TotlaMaket is the same as firmMarket. And when I run your code

mutate(NewData,
       firmMarket=cshoc*prccd)%>%
  group_by(gvkey,datadate)%>%
  mutate(TotalMarket=sum(firmMarket,na.rm=TRUE))%>%
  mutate(weightage=firmMarket/TotalMarket)%>%
  mutate(PriceWeightage=weightage*prccd)%>%
  group_by(gvkey,datadate)%>%
  mutate(MarketIndex=sum(PriceWeightage,na.rm = TRUE))

I get this output

# A tibble: 1,048,575 x 9
# Groups:   gvkey, datadate [1,031,402]
   gvkey datadate    cshoc prccd firmMarket TotalMarket    weightage PriceWeightage MarketIndex
   <dbl>    <dbl>    <dbl> <dbl>      <dbl>       <dbl>        <dbl>          <dbl>       <dbl>
 1  7041 20000103 42000000  3.86 162120000      4.33e15 0.0000000375    0.000000145        17.7
 2  7041 20000104 42000000  4.28 179760000      4.33e15 0.0000000415    0.000000178        17.7
 3  7041 20000105 42000000  4    168000000      4.33e15 0.0000000388    0.000000155        17.7
 4  7041 20000106 42000000  4.04 169680000      4.33e15 0.0000000392    0.000000158        17.7
 5  7041 20000111 42000000  4    168000000      4.33e15 0.0000000388    0.000000155        17.7
 6  7041 20000112 42000000  3.96 166320000      4.33e15 0.0000000384    0.000000152        17.7
 7  7041 20000113 42000000  3.92 164640000      4.33e15 0.0000000380    0.000000149        17.7
 8  7041 20000114 42000000  3.96 166320000      4.33e15 0.0000000384    0.000000152        17.7
 9  7041 20000117 42000000  4.06 170520000.     4.33e15 0.0000000394    0.000000160        17.7
10  7041 20000118 42000000  4.14 173880000      4.33e15 0.0000000402    0.000000166        17.7
# ... with 1,048,565 more rows

I don't know what is missing to calculate TotalMarket values.

totalmarket is the same as full markey because its a summation by the groups gvkey AND datadate.
to not use datadate to split the groups simply remove it

mutate(NewData,
       firmMarket=cshoc*prccd)%>%
  group_by(gvkey)%>%
  mutate(TotalMarket=sum(firmMarket,na.rm=TRUE))%>%
  mutate(weightage=firmMarket/TotalMarket)
# A tibble: 10 x 7
# Groups:   gvkey [1]
   gvkey datadate    cshoc prccd firmMarket TotalMarket weightage
   <dbl>    <dbl>    <dbl> <dbl>      <dbl>       <dbl>     <dbl>
 1  7041 20000103 42000000  3.86 162120000   1689240000    0.0960
 2  7041 20000104 42000000  4.28 179760000   1689240000    0.106 
 3  7041 20000105 42000000  4    168000000   1689240000    0.0995
 4  7041 20000106 42000000  4.04 169680000   1689240000    0.100 
 5  7041 20000111 42000000  4    168000000   1689240000    0.0995
 6  7041 20000112 42000000  3.96 166320000   1689240000    0.0985
 7  7041 20000113 42000000  3.92 164640000   1689240000    0.0975
 8  7041 20000114 42000000  3.96 166320000   1689240000    0.0985
 9  7041 20000117 42000000  4.06 170520000.  1689240000    0.101 
10  7041 20000118 42000000  4.14 173880000   1689240000    0.103

To make TotalMarket values more clear, please see the image. In this, we have different gvkeys with datadates and all data will have sum based on datadate. Also if you see a different gvkey with the same date have the same values. So I want TotlaMarket value should be the same for the same date for different gvkeys.

I hope I am able to make my question clear.

Thanks

you are saying you want the total to be the sum of firm values by date (not by gvkey)

mutate(NewData,
       firmMarket=cshoc*prccd)%>%
  group_by(datadate)%>%
  mutate(TotalMarket=sum(firmMarket,na.rm=TRUE))%>%
  mutate(weightage=firmMarket/TotalMarket)

No no, let see this way ,
if i do sum of firmMarket values based on gvkey and datadate and store them in TotlaMarket. So for all gvkey and datadate all values will be stored in TotalMarket and value will be same for same date with different gvkey.

This contradicts different gvkeys having the same TotalMarkey values on matching days...

Yes, want to keep same values for different gvkeys for matching days. Is it possible to do.!!!

Thanks

yes, I already showed you, but it seems you didnt understand

if is use your mentioned code,
this is my output

# A tibble: 1,048,575 x 7
# Groups:   gvkey [288]
   gvkey datadate    cshoc prccd firmMarket TotalMarket    weightage
   <dbl>    <dbl>    <dbl> <dbl>      <dbl>       <dbl>        <dbl>
 1  7041 20000103 42000000  3.86 162120000      4.33e15 0.0000000375
 2  7041 20000104 42000000  4.28 179760000      4.33e15 0.0000000415
 3  7041 20000105 42000000  4    168000000      4.33e15 0.0000000388
 4  7041 20000106 42000000  4.04 169680000      4.33e15 0.0000000392
 5  7041 20000111 42000000  4    168000000      4.33e15 0.0000000388
 6  7041 20000112 42000000  3.96 166320000      4.33e15 0.0000000384
 7  7041 20000113 42000000  3.92 164640000      4.33e15 0.0000000380
 8  7041 20000114 42000000  3.96 166320000      4.33e15 0.0000000384
 9  7041 20000117 42000000  4.06 170520000.     4.33e15 0.0000000394
10  7041 20000118 42000000  4.14 173880000      4.33e15 0.0000000402
# ... with 1,048,565 more rows

i am not understanding why TotlaMarket values are coming in "4.33e15" for all. But as per your output, it should be "168924000".

You refer now to code I posted before your statements about total market being across gvkey by grouping datadate only, I provided different code after those statements. It is those codes which would show the same TotalMarket value for every matching datadate across varying gvkeys... (yet you complained about them also, in a way that eludes my understanding)

Can you please copy paste new code here.. a lot of confusion. Sorry for that and my output is not coming same as you old code too.

NewData<- data.frame(
  gvkey = c(7041, 7041, 7041, 7041, 7041, 7041, 7041, 7041, 7041, 7041),
  datadate = c(20000103,20000104,20000105,20000106,
               20000111,20000112,20000113,20000114,20000117,20000118),
  cshoc = c(4.2e+07,4.2e+07,4.2e+07,4.2e+07,4.2e+07,
            4.2e+07,4.2e+07,4.2e+07,4.2e+07,4.2e+07),
  prccd = c(3.86, 4.28, 4, 4.04, 4, 3.96, 3.92, 3.96, 4.06, 4.14)
)

mutate(NewData,
       firmMarket=cshoc*prccd)%>%
  group_by(datadate)%>%
  mutate(TotalMarket=sum(firmMarket,na.rm=TRUE))%>%
  mutate(weightage=firmMarket/TotalMarket)
# A tibble: 10 x 7
# Groups:   datadate [10]
   gvkey datadate    cshoc prccd firmMarket TotalMarket weightage
   <dbl>    <dbl>    <dbl> <dbl>      <dbl>       <dbl>     <dbl>
 1  7041 20000103 42000000  3.86 162120000   162120000          1
 2  7041 20000104 42000000  4.28 179760000   179760000          1
 3  7041 20000105 42000000  4    168000000   168000000          1
 4  7041 20000106 42000000  4.04 169680000   169680000          1
 5  7041 20000111 42000000  4    168000000   168000000          1
 6  7041 20000112 42000000  3.96 166320000   166320000          1
 7  7041 20000113 42000000  3.92 164640000   164640000          1
 8  7041 20000114 42000000  3.96 166320000   166320000          1
 9  7041 20000117 42000000  4.06 170520000.  170520000.         1
10  7041 20000118 42000000  4.14 173880000   173880000          1
1 Like

Ok great, thanks i understand, but now when i use your code, i am getting some different output.

# Groups:   gvkey, datadate [1,031,402]
   gvkey datadate    cshoc prccd firmMarket TotalMarket    weightage PriceWeightage MarketIndex
   <dbl>    <dbl>    <dbl> <dbl>      <dbl>       <dbl>        <dbl>          <dbl>       <dbl>
 1  7041 20000103 42000000  3.86 162120000      4.33e15 0.0000000375    0.000000145        17.7
 2  7041 20000104 42000000  4.28 179760000      4.33e15 0.0000000415    0.000000178        17.7
 3  7041 20000105 42000000  4    168000000      4.33e15 0.0000000388    0.000000155        17.7
 4  7041 20000106 42000000  4.04 169680000      4.33e15 0.0000000392    0.000000158        17.7
 5  7041 20000111 42000000  4    168000000      4.33e15 0.0000000388    0.000000155        17.7
 6  7041 20000112 42000000  3.96 166320000      4.33e15 0.0000000384    0.000000152        17.7
 7  7041 20000113 42000000  3.92 164640000      4.33e15 0.0000000380    0.000000149        17.7
 8  7041 20000114 42000000  3.96 166320000      4.33e15 0.0000000384    0.000000152        17.7
 9  7041 20000117 42000000  4.06 170520000.     4.33e15 0.0000000394    0.000000160        17.7
10  7041 20000118 42000000  4.14 173880000      4.33e15 0.0000000402    0.000000166        17.7
# ... with 1,048,565 more rows

How I can solve this issue. My TotalMarket value is coming different. I hope this will be my last question :wink:

my totalmarket is based on 10 rows, yours on 1million, therefore you should not expect it to be literally the same number...

I cant comment on whether you have done things correctly or not, because there are no other gvkeys shown that would indicate whether totalmarket matches from them across datadates....

However, it does seem like you have simply not implemented my suggestion, my code fragment, involves

# Groups:   datadate [10]

yours

gvkey, datadate [1,031,402]

perhaps your wrote more code after my fragment to set different group statements, perhaps you didnt, I can't know. I can only see that it differs.

1 Like

Thanks for your input.