Interpolation using varying dates

cj_risk · April 28, 2019, 10:51pm

I am looking for a way do linear interpolation between one variable (inv) based the days between another date variable (mth) with the output being a daily time series with interpolated "inv" values. The data series appears to be "quasi" weekly, so the days vary between each observation of "mth", which is proving to be very difficult. I have limited experience wtih R and date calculations, so that's a huge part of my problem. The dataframe is large, but so I created a simple dataframe example below (hopefully a reprex) to illustrate the problem.

library(tidyverse)
library(lubridate)

mth <- c("2019-01-02", "2019-01-09", "2019-01-14", "2019-01-21", "2019-02-03")
mth <- mth %>% as_date(mth)
inv <- c("2450", "2385", "2207", "2115", "1975")
inv <- inv %>% as.numeric(inv)
df <- data.frame(mth, inv)
df
mth inv
1 2019-01-02 2450
2 2019-01-09 2385
3 2019-01-14 2207
4 2019-01-21 2115
5 2019-02-03 1975

I would like the "inv" variable to be interpolated for each day betwwen the "mth" variable creating a daily time series as follows:

mth inv
2019/01/01 2450
2019/01/02 2441
2019/01/03 2450
2019/01/04 2459
2019/01/05 2469
2019/01/06 2478
2019/01/07 2487
2019/01/08 2496
2019/01/09 2385
.
.
.
etc.

Any help you can provide would be greatly appreciated.

CJ

FJCC · April 29, 2019, 12:03am

The code below produces a vector of linearly interpolated values called LinearFit and also a cubic spline interpolation that provides a smoother curve.

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date

mth <- c("2019-01-02", "2019-01-09", "2019-01-14", "2019-01-21", "2019-02-03")
mth <- as_date(mth)
inv <- c(2450, 2385, 2207, 2115, 1975)
df <- data.frame(mth, inv)
df
#>          mth  inv
#> 1 2019-01-02 2450
#> 2 2019-01-09 2385
#> 3 2019-01-14 2207
#> 4 2019-01-21 2115
#> 5 2019-02-03 1975

ApproxFun <- approxfun(x = df$mth, y = df$inv)
Dates <- seq.Date(ymd("2019-01-02"), ymd("2019-02-03"), by = 1)
LinearFit <- ApproxFun(Dates)
head(LinearFit)
#> [1] 2450.000 2440.714 2431.429 2422.143 2412.857 2403.571

###cubic spline for comparison
SplineFun <- splinefun(x = df$mth, y = df$inv)
SplineFit <- SplineFun(Dates)
#####

plot(df$mth, df$inv, ylim = c(1950, 2500))
lines(Dates, SplineFit, col = "red")
lines(Dates, LinearFit, col = "blue")

^{Created on 2019-04-28 by the reprex package (v0.2.1)}

cj_risk · May 8, 2019, 7:42pm

Thanks for your help. Sorry I'm on steep learning curve here, so excuse my ignorance. I am not familiar with approxfun. It appears to be a matrix with x and y values. I need to do further number crunching on the interpolated values. How do I get the dates and interpolated values into a dataframe object so I can perform further calculations.

FJCC · May 8, 2019, 7:59pm

You can use the data.frame() function to load the vector of dates and the vector of interpolated values into a data frame.

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
mth <- c("2019-01-02", "2019-01-09", "2019-01-14", "2019-01-21", "2019-02-03")
mth <- as_date(mth)
inv <- c(2450, 2385, 2207, 2115, 1975)
df <- data.frame(mth, inv)
df
#>          mth  inv
#> 1 2019-01-02 2450
#> 2 2019-01-09 2385
#> 3 2019-01-14 2207
#> 4 2019-01-21 2115
#> 5 2019-02-03 1975

ApproxFun <- approxfun(x = df$mth, y = df$inv)
Dates <- seq.Date(ymd("2019-01-02"), ymd("2019-02-03"), by = 1)
LinearFit <- ApproxFun(Dates)
head(LinearFit)
#> [1] 2450.000 2440.714 2431.429 2422.143 2412.857 2403.571

newDF <- data.frame(Dates = Dates, FitData = LinearFit)
head(newDF)
#>        Dates  FitData
#> 1 2019-01-02 2450.000
#> 2 2019-01-03 2440.714
#> 3 2019-01-04 2431.429
#> 4 2019-01-05 2422.143
#> 5 2019-01-06 2412.857
#> 6 2019-01-07 2403.571

^{Created on 2019-05-08 by the reprex package (v0.2.1)}

cj_risk · May 12, 2019, 6:41pm

Very helpful. I learned a lot. Thanks for sharing.

I now have another related question. I have interpolated the values going back 10 years. I now need to determine the min and max values for each day for the prior 3 years through the dataframe. I can do this with mutate, select days for last 3 years, calculate min and max, etc., but this seems like it's very inefficient. Is there a more efficient approach I should consider? I'm trying to be as efficient as possible because I will be applying this code to much larger dataframes.

Thanks

FJCC · May 12, 2019, 10:21pm

Do I understand correctly that you want to look in a window of time three years wide and calculate the max and min values, then slide the window by one day and repeat?

cj_risk · May 16, 2019, 9:44pm

Sorry for the delay, and for not being clear. No, I wanted to look at the min, max and average for the interpolated values for the same day in the last 5 years. For example, max(2019-01-14, 2018-01-14, 2017-01-14, 2016-01-14, 2015-01-14).

Tks

CJ

FJCC · May 17, 2019, 1:33am

I would start with code like the following. The columns for MinYear and MaxYear are not necessary, but if you will have several data frames like this, it might help you keep track of what you are looking at.

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:lubridate':
#> 
#>     intersect, setdiff, union
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#make fake data
df <- data.frame(Date = seq.Date(from = ymd("2015-01-01"), to = ymd("2019-12-31"), by = 1),
                 Value = runif(1826, 0, 5))

df <- df %>% mutate(Year = year(Date), Mnth = month(Date), Day = day(Date))
Stats <- df %>% group_by(Mnth, Day) %>% 
  summarize(Avg = mean(Value), Max = max(Value), Min = min(Value), 
            MinYear = min(Year), MaxYear = max(Year))
head(Stats)
#> # A tibble: 6 x 7
#> # Groups:   Mnth [1]
#>    Mnth   Day   Avg   Max   Min MinYear MaxYear
#>   <dbl> <int> <dbl> <dbl> <dbl>   <dbl>   <dbl>
#> 1     1     1  2.72  4.26 1.38     2015    2019
#> 2     1     2  2.46  3.60 1.16     2015    2019
#> 3     1     3  1.82  3.81 0.492    2015    2019
#> 4     1     4  3.38  4.64 0.938    2015    2019
#> 5     1     5  2.98  4.50 1.24     2015    2019
#> 6     1     6  2.32  4.25 0.148    2015    2019

^{Created on 2019-05-16 by the reprex package (v0.2.1)}
I am not sure how you want to handle your data set of ten years; either stepping through 2010-2014, 2011-2015 ... 2015-2019 or just do two calculations 2010-2014 and 2015-2019. In either case, you can construct a for loop and use the filter() function to grab the subset that you need. Each output data frame can be stored in a list. I can provide more detail if you can explain how you want to split up the years.

cj_risk · May 21, 2019, 5:40pm

Very helpful. Thanks.

I broke it into 2 datasets. Anything before 5 years was not really comparable due to changes in the underlying market structure.

Thanks again for your feedback.

CJ

system · June 11, 2019, 5:40pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.