Demeaning / Mean-Centering of certain values only

rstudio
statistics

#1

Hi everyone,

I looked for an answer but wasn't able to find one, so I hope this is a topic, which hasn't been touched upon yet.

I have a xlsx panel data file from CRSP for monthly stock returns for all companies in the NYSE from 1963 until 2017. Those are the columns:

Month (in DD/MM/YYYY) | Company Code | Company Name | Industry Code | Return

e.g. 31/03/1981 | AAPL | Apple Inc. | 12345 | -0.07512

Now I want to calculate the demeaning returns in an extra column, i.e. Return(Apple,March1981) MINUS Mean-Return(Apple,whole sample period).
I was trying with the scale function, but obviously, it demeans the return for the mean of all returns, not only apple specific. How do I code that it should only take the apple returns into the mean calucations?

Thank you in advance and have a good day!

Edit: deleted the screenshot, better example provided further below.


#2
require(tidyverse)
your_data %>% 
  tbl_df() %>% 
  group_by(ticker_symbol) %>%
  mutate(avg_return = mean(return)) %>%
  mutate(demeaned_return = return - avg_return)

Like that?


#3

It doesn't work I think because there are also non-numeric values in the column, sometimes there are blanks or letters like "A" or "B" in the column
I tried it with:
dplyr::select_if(Returns, is.numeric)

and

m <- mean(nums, na.rm = T)
nums[is.na(nums)] <- m
Returns_new.x <- nums```

But it did not work.

#4

Since I see that you've attached the screenshot, I assume that your data is not sensitive. For that reason I would ask you to put your question into reprex:

But to perhaps answer your question. You can always try to convert columns that you need to numeric with as.numeric command. It will replace all letters and such to NA's. You can then try the approach proposed by @Stephen.

And just to clarify a bit -- dplyr::select_if works on columns, not on rows. You can work on rows with, e.g., combination of dplyr::mutate and dplyr::case_when


#5

Hi,

I just tried to do it, but reprex does not work for me? It gives back the error:

No input provided and clipboard is not available.
Unable to put result on the clipboard. How to get it:

  • Capture what reprex() returns.
  • Use outfile = "foo" to request output in specific file.

So if I try @Stephen 's solution, I need to transform the column into numbers and NAs first and then try his code?


#6

Can you try installing reprex with

devtools::install_github("tidyverse/reprex")

and trying again? There is a problem with it at the moment. Also, don't forget to actually copy the input that you want to reprex with ctrl+c or cmd+c on mac.

To your question - what happens if you try to convert columns that you need to numeric with the approach I've suggested? Does it produce the output you need? If yes, great :slight_smile:. If not, let's see what goes wrong.


#7

You'll need to replace your non-numeric values, then run the as.numeric() function on the column, as in your_data %>% mutate(returns_clean = as.numeric(Returns)). That will produce errors, depending on what values you have in the column that aren't able to be converted. So...

  1. Replace values that can't be converted to numeric (letters, and the blanks if the blanks are supposed to be 0 and not NA).
  2. Convert the column to numeric with mutate.
  3. Group by symbol and use mutate create the means (as in my earlier example).
  4. Use mutate to create the demeaned variable.

Without seeing the error message, we can't tell you how to do step 1.

A note on your earlier comment:

select_if() chooses columns, not rows, so it will completely skip over the column you care about if it's not already numeric.


#8
  1. devtools::install_github("tidyverse/reprex")
# load data manually or
CRSP_data63 <- read_xlsx("datapath/(MOD) CRSP - Standard63.xlsx")
#> Error in read_xlsx("datapath/(MOD) CRSP - Standard63.xlsx"): konnte Funktion "read_xlsx" nicht finden
testdata1 <- CRSP_data63
#> Error in eval(expr, envir, enclos): Objekt 'CRSP_data63' nicht gefunden
attach(testdata1)
#> Error in attach(testdata1): Objekt 'testdata1' nicht gefunden

# Demeaning returns,

require(tidyverse)
#> Lade nötiges Paket: tidyverse
#> Warning: Paket 'tidyverse' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'ggplot2' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'tibble' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'tidyr' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'readr' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'purrr' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'stringr' wurde unter R Version 3.4.4 erstellt
#> Warning: Paket 'forcats' wurde unter R Version 3.4.4 erstellt

testdata1 %>% 
  tbl_df() %>% 
  group_by("Ticker Symbol") %>%
  mutate(avg_return = mean(Returns_new.x)) %>%
  mutate(demeaned_return = Returns - avg_return)
#> Error in eval(lhs, parent, parent): Objekt 'testdata1' nicht gefunden

Created on 2018-05-18 by the reprex package (v0.2.0).

Not sure why it says, that it cannot load the data (it's German somehow), but I have the data in my global environment.

  1. testdata1 %>% mutate(returns_clean = as.numeric(Returns))
> testdata1 %>% mutate(returns_clean = as.numeric(Returns))
# A tibble: 1,048,575 x 23
   PERMNO `Names Date` `Share Code` `Exchange Code` `Standard Indust~ CUSIP `Ticker Symbol` `Company Name` `Share Class`
    <dbl> <chr>               <dbl>           <dbl>             <dbl> <chr> <chr>           <chr>          <chr>        
 1  10000 31/12/1985             NA              NA                NA NA    NA              NA             NA           
 2  10000 31/01/1986             10               3              3990 6839~ OMFGA           OPTIMUM MANUF~ A            
 3  10000 28/02/1986             10               3              3990 6839~ OMFGA           OPTIMUM MANUF~ A            
 4  10000 31/03/1986             10               3              3990 6839~ OMFGA           OPTIMUM MANUF~ A            
 5  10000 30/04/1986             10               3              3990 6839~ OMFGA           OPTIMUM MANUF~ A            
 6  10000 30/05/1986             10               3              3990 6839~ OMFGA           OPTIMUM MANUF~ A            
 7  10000 30/06/1986             10               3              3990 6839~ OMFGA           OPTIMUM MANUF~ A            
 8  10000 31/07/1986             10               3              3990 6839~ OMFGA           OPTIMUM MANUF~ A            
 9  10000 29/08/1986             10               3              3990 6839~ OMFGA           OPTIMUM MANUF~ A            
10  10000 30/09/1986             10               3              3990 6839~ OMFGA           OPTIMUM MANUF~ A            
# ... with 1,048,565 more rows, and 14 more variables: `Primary Exchange` <chr>, `Trading Status` <chr>, `Security
#   Status` <chr>, PERMCO <dbl>, `CUSIP Header` <chr>, `Date of Delisting Payment` <chr>, `Delisting Code` <dbl>,
#   `Delisting Return Without Dividends` <dbl>, `Delisting Return` <dbl>, `Price or Bid/Ask Average` <dbl>,
#   Returns <chr>, `Returns without Dividends` <chr>, `Return adj.` <dbl>, returns_clean <dbl>
Warning message:
In evalq(as.numeric(Returns), <environment>) : NAs introduced by coercion

#9

As you can see from the errors, data that you are working with is only available for you. The point of reprex is to provide people that want to help you with all the data that you are using. If it is only available on your computer then I can't really answer your question since I don't have the data you are trying to work with.

For example, take a look at this question:

As you can see author actually creates a fake dataset using a pre-defined function and then gives the output that they need (note the use of structure to simulate output). This way I can just copy-paste the input and output, find a solution and be sure that this solution is what you need.

It does seem excessive at first since you need to jump through so many seemingly unnecessary hoops, but in the long run it is for sure a skill that will pay off, so I can highly encourage you to try it here.

There is more help available about how to create fake datasets or even using your dataset as an input here:

Quick edit: there is also a possibly useful post explaining the idea about reprex using multiple analogies:


#10

I hope this format is lso okay, since I am not so proficient with R and I would love to understand how that methodology works, but I'm a bit under pressure timewise.

test1 <- matrix(c("AAPL", "31/03/1998", "C",
                 "AAPL", "30/04/1998", 0.2,
                 "AAPL", "31/05/1998", 0.15,
                 "INTL", "31/03/1998", 0.05,
                 "INTL", "30/04/1998", "",
                 "INTL", "31/05/1998", -0.3,
                 "GOOG", "31/03/1998", 0.1,
                 "AAPL", "30/04/1998", "B",
                 "AAPL", "31/05/1998", 0.4
                 ),
ncol=3,byrow=TRUE)
colnames(test1) <- c("Ticker Symbol","Date","Returns")
smoke <- as.table(test1)
test1
#>       Ticker Symbol Date         Returns
#>  [1,] "AAPL"        "31/03/1998" "C"    
#>  [2,] "AAPL"        "30/04/1998" "0.2"  
#>  [3,] "AAPL"        "31/05/1998" "0.15" 
#>  [4,] "INTL"        "31/03/1998" "0.05" 
#>  [5,] "INTL"        "30/04/1998" ""     
#>  [6,] "INTL"        "31/05/1998" "-0.3" 
#>  [7,] "GOOG"        "31/03/1998" "0.1"  
#>  [8,] "AAPL"        "30/04/1998" "B"    
#>  [9,] "AAPL"        "31/05/1998" "0.4"

Created on 2018-05-18 by the reprex package (v0.2.0).

Edited for the letters in returns


#11

Thanks for the effort to make it work!

I've taken the code from @Stephen and only added one more modification where I convert returns to numeric. Does it do what you want in the end?

library(tidyverse)
test1 <- matrix(c("AAPL", "31/03/1998", "C",
                  "AAPL", "30/04/1998", 0.2,
                  "AAPL", "31/05/1998", 0.15,
                  "INTL", "31/03/1998", 0.05,
                  "INTL", "30/04/1998", "",
                  "INTL", "31/05/1998", -0.3,
                  "GOOG", "31/03/1998", 0.1,
                  "AAPL", "30/04/1998", "B",
                  "AAPL", "31/05/1998", 0.4
),
ncol=3,byrow=TRUE)
colnames(test1) <- c("ticker_symbol","Date","return")
test1_tbl <- tibble::as_tibble(test1)


test1_tbl %>% 
  group_by(ticker_symbol) %>%
  mutate(return = as.numeric(return)) %>%
  mutate(avg_return = mean(return, na.rm = TRUE)) %>%
  mutate(demeaned_return = return - avg_return)
#> Warning in evalq(as.numeric(return), <environment>): NAs introduced by
#> coercion
#> # A tibble: 9 x 5
#> # Groups: ticker_symbol [3]
#>   ticker_symbol Date         return avg_return demeaned_return
#>   <chr>         <chr>         <dbl>      <dbl>           <dbl>
#> 1 AAPL          31/03/1998  NA           0.250         NA     
#> 2 AAPL          30/04/1998   0.200       0.250        - 0.0500
#> 3 AAPL          31/05/1998   0.150       0.250        - 0.100 
#> 4 INTL          31/03/1998   0.0500     -0.125          0.175 
#> 5 INTL          30/04/1998  NA          -0.125         NA     
#> 6 INTL          31/05/1998 - 0.300      -0.125        - 0.175 
#> 7 GOOG          31/03/1998   0.100       0.100          0     
#> 8 AAPL          30/04/1998  NA           0.250         NA     
#> 9 AAPL          31/05/1998   0.400       0.250          0.150

Created on 2018-05-18 by the reprex package (v0.2.0).


#12

Sorry for the late reply. It works now, I just had to assign the test1_tbl back to an object, that did the trick for me!
Thank you for helping me out here.