Deleting Specific columns with NA values in financial time series data

polymath548 · January 10, 2020, 6:23am

Hi,
I have a time series data on 10 different countries and each country has 10 different variables. I need a way to find out to pick up the data for each country only if they have values for all 10 variables in each period. If any country has NA value for any variable then I should ignore data related to it and pick up the values as soon as it has all the data for each variable.

Following is the example:
Date us_gdp Uk_gdp jp_gdp us_inf uk_inf jp_inf us_bond uk_bond jp_bond
Jan-78 1 NA NA 2 3 NA 3 4 1
Feb-78 2 1 NA 2 1 2 3 1 1
Mar-78 3 2 1 2 2 4 2 1 1

Here, for Jan-78 I should be able to collect only US data and in Feb-78 I should have data for US and UK while in Mar-78, All 3 countries data should be collected.

I hope I am making sense.

I am struggling to figure out a way to even get started. If someone can give a hint on how to proceed, it would be of great help.

Solarion · January 10, 2020, 8:51am

it is usually better if you provide a data table in a way that lets u quickly generate it ourselves, in your case that would be:

df <- data.frame(
  us_gdp = c(1,2,3),
  uk_gdp = c(NA,1,2),
  jp_gdp = c(NA,NA,2),
  us_inf = c(2,2,2),
  uk_inf = c(3,1,2),
  jp_inf = c(NA,2,4),
  us_bond = c(3,4,1),
  uk_bond = c(3,1,1),
  jp_bond = c(2,1,1),
  row.names = c("Jan-78","Feb-78","Mar-78")
)

not sure exactly what output you want, but I came up with this:

make a vector for each country, just stating the column-numbers for each of its entries (e.g. for us that would be: c(1,4,7). It is easiest to use the seq() function. Make sure to set the by-value to the number of countries you have in total, the to-value to the total number of columns you have, and the from-value is individual for each country.

us <- seq(from=1, to=9, by=3)
uk <- seq(from=2, to=9, by=3)
jp <- seq(from=3, to=9, by=3)

isolate all columns for a certain country from the data.frame, then use the na.omit() function to throw out any rows that contain an NA in any column

df_us <- na.omit(df[,us])
df_uk <- na.omit(df[,uk])
df_jp <- na.omit(df[,jp])

this will give you a data.frame for each country, containing all columns for that country, but only the rows where there is no NA.

polymath548 · January 10, 2020, 9:29am

Thanks Solarion. I am new to the platform and R programming. My apologies for not posting the data table.

"not sure exactly what output you want, but I came up with this:"

The output you posted is really helpful, I am actually trying to backtest a momentum strategy based on macroeconomic factors. I just want to test the strategy for countries which have all the required data. Since I do not want to delete entire rows with NA values, I did not know how to proceed. But thanks to your post, now I can proceed with the work.

"isolate all columns for a certain country from the data.frame, then use the na.omit() function to throw out any rows that contain an NA in any column"

Since one of the region is euro area, it creates difficulty as European countries do not have separate currency column (they have common currency), one of the option I have is to manually provide column numbers for each country.

navnit_ranjan · January 10, 2020, 9:34am

Something like this should help

 df %>% na.omit()
 
 ´´´

Solarion · January 10, 2020, 10:25am

manually providing the column numbers would work
instead of providing the vector manually c(1,4,7) or seq(1,9,3) you can also use the grep() function, which searches for text in a vector and outputs the indices of the vector that have the text. Take the data.frame df from my previous post. In order to create the indices-vectors (us, uk and jp) you can do this:

us <- grep( "us" , colnames(df) )

if you have a european country and a common european column (e.g. it for italy and eu for europe), you can do this:

it <- grep( "it|eu" , colnames(df) )

The | between it and eu stands as an "or", so any column-name with an "it" or an "eu" in it will be found. If you want to make certain that you do not pick a column by accident because of the "gdp", "inf", "bond", etc, where the "it" or "eu" could also be found, you can add the ^ to the searching text, which indicates the beginning, so:

it <- grep( "^it|^eu" , colnames(df) )

this will be saver, as it only looks for "it" and "eu" at the very beginning of the string

polymath548 · January 10, 2020, 10:45am

Thank you so much Solarion.

Yeah, doing it manually is not a good idea so I figured a way (using stackoverflow) to do it using stringr package function str_detect()

#creating country wise data frames
df_us <- alldata[,str_detect(names(alldata), 'us_')]
df_cn <- alldata[,str_detect(names(alldata), 'cn_|ca_')]
df_uk <- alldata[,str_detect(names(alldata), 'uk')]
df_bd <- alldata[,str_detect(names(alldata), 'bd_')]
df_fr <- alldata[,str_detect(names(alldata), 'fr_')]
df_it <- alldata[,str_detect(names(alldata), 'it_')]
df_es <- alldata[,str_detect(names(alldata), 'es')]
df_sw <- alldata[,str_detect(names(alldata), 'sw_')]
df_sd <- alldata[,str_detect(names(alldata), 'sd_')]
df_au <- alldata[,str_detect(names(alldata), 'au_')]
df_nz <- alldata[,str_detect(names(alldata), 'nz_')]
df_jp <- alldata[,str_detect(names(alldata), 'jp_')]
df_hk <- alldata[,str_detect(names(alldata), 'hk_')]

#Deleting NA values from each country list
df_us <- na.omit(df_us)
df_cn <- na.omit(df_cn)
df_uk <- na.omit(df_uk)
df_bd <- na.omit(df_bd)
df_fr <- na.omit(df_fr)
df_it <- na.omit(df_it)
df_es <- na.omit(df_es)
df_sw <- na.omit(df_sw)
df_sd <- na.omit(df_sd)
df_au <- na.omit(df_au)
df_nz <- na.omit(df_nz)
df_jp <- na.omit(df_jp)
df_hk <- na.omit(df_hk)

I extracted data similarly for other countries. I still somehow feel that the code is still not efficient. By that I mean there's lot of repetition in it.

thanks for your help

Solarion · January 10, 2020, 11:10am

if you want to improve the code, make a function which will do everything and only needs a string as an input:

isolate <- function(countries){
  na.omit( alldata[,stringr::str_detect(names(alldata), countries)] )
}

df_us <- isolate("us_")
df_sw <- isolate("sw_")
#etc.

polymath548 · January 10, 2020, 11:30am

Exactly what I was looking for. Thank you so much

system · January 31, 2020, 11:30am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.