Linear multiple Regression with autoregressive term

Hello everybody,

I try to do electricity price forecasting. For that I want to use following (simplyfied) regression equation:

Y_t = c1 * A_t + c2 * B_t + c3 * C_t + c4 * Y_(t-1)

As you see the first three summands are like a normal multiple linear regression, which I could easly determine with the lm-function. But the problem is, that the last summand is a autoregressive term. I already found some explanations for a autoregressive model, but I have no idea how I can merge them or if this is even possible.

So it would be very nice if anybody can help me here. Thanks in advance!

Best regards


You can write Y_(t-1) as lag(Y). Does that help?

Unfortunately not. Then R just makes all other coefficient very low (almost zero) and so they don´t have any effect on the price. Actually this doens´t makes any sense.

Further I compared these two functions:

The first one, is one for a onedimensional autoregressive model and the second for a linear regression model. So if lag(y) would work, the results would be the same. They are not...

Further acutally it´s not Y_(t-1) I´m looking for, I´m looking for Y_(t-24) (previous day).

But thanks for the proposal :slight_smile:

(1) You can use lag(Y,24) if that helps.

(2) You would not expect the two models to look exactly the same. They should have (almost) the same coefficient on the lag term, but the intercept will look different.

(3) If you are interested in just forecasting (as opposed to finding out the effects of A, B, and C) there's nothing necessarily wrong with the lag coefficient being all that matters. Not uncommon, especially if the coefficient on the lag is close to 1.0.

(1) Thanks, good to know (Sry just a R and programming beginner)

(2) The coefficient for ar.ols is 0.9252 and for the lm-function 1.0. And 1.0 can´t be a valid solution if I take a look at the data set. That woul mean the price is alawys the price of the privious hour.

(3) Actually I´m mostly interested in forecasting. My main Problem is, that my forecastprogram without the impacts of A, B, C and actually a few more, just have an R² of 0.45 (compared to past years) and this mostly depends on the inputdata I have to work with. My actual objective is, that the forecast becomse more reliable while using the coefficents. My hope is, that these coeffiecents can help me to repair the mistakes I have because of my data source.

A difference that big in the lag coefficients is actually quite surprising, so I can see why it bothers you. Having said that, 1.0 as a coefficient doesn't mean that the price is the same as in the previous hour, it means that the change in the price equals the intercept plus the random error in the regression. (This is called a "random walk.")

Two things to check:
(A) The extra variables may help the R^2 even if the individual coefficients aren't significant.
(B) You might try a higher order set of lags. For example, an AR(2) model might include both Y_(t-1) and Y_(t-2).

Ok, thanks for your advices! But I´m still not sure if lag(Y) is the solution I´m looking for.

The problem is, that in MultipleRegression.xlsx the real EPEX-Prices (Y) are implemented, because they are needed for the linear regression. If I use lag(EPEX), also the real price of the privious hour is taken, but this isn´t what I want to achieve for autoregression, isn´t it? In my opinion autoregression is a kind of solver. The autoregressive term contains my modeld price of the privious hour, with the calculated (solved) coefficients. But maybe this isn´t even possible...

If I´m wrong and for the autoregression I also have to take the real prices. I just implemented another column in my database with the prices from the previous hour. But even then I get other results as with the lag function.

I hope it´s understandable... it´s a little complex and even in my native language hard to explain :smiley:

That's a helpful picture. There is something wrong with your equation. Note that the R^2 equals 1, which means you are getting a perfect fit. You are also getting an error message to this effect. There is no way from the posting to tell what the problem is--you might want to plot your dependent variable and see if there is anything funny about it. For example, are most consecutive observations unchanged?

PS: Your English is really quite good.

This problem also appears if I try the lag function in the lm function for a simpler case with less variables.

With "lm (EPEX~lag(EPEX,1), data = Koeffizienten)" I try to represent a autoregressive term in a linear regression, as you recommended above. To make it simpler I omitted all other terms, at the point we are at the moment we don´t need them.

I can´t find anything funny in the plot.

I repeat my objective again, because I´m not sure if it´s clear right now. I want to have a regression model, which calculates the powerprice by considering linear and autoregressive terms. Autoregressive means (for me), that the calculated price depends on the calculated price (from the regression model) from a privious time step.

It sounds logical to use the lag-function to achieve this objective, but it isn´t working as you see. Maybe it´s a kind of boundary value problem, because of the first hour where no privious hour is defined. But even than, I don´t have an idea how to solve it.

PS: Sry for the late response. I really appreciate your effort!

And here the plot:

The problem is R^2==1.

You might want to print out the first dozen or so values or EPEX and show them to us.

That data looks okay. (I am surprised.) On the surface, it looks like you are doing everything right.

You are showing us a picture of an Excel sheet. Try giving the command dput(head(EPEX,n = 100)) and copying and pasting the output here. That might give someone a better shot at figuring out the problem.

I integrated the excel as "Koeffizienten", probably you meant that.

That's a picture. If you copy and paste as text from your console window then people can grab the posted data and experiment with it.

Oh sry... :smiley: didn´t get that

structure(list(EPEX = c(25.02, 18.29, 16.04, 14.6, 14.95, 14.5,
10.76, 12.01, 12.39, 14.04, 14.68, 16.08, 16.08, 16.05, 16.04,
16.1, 23.93, 26.9, 26.36, 23.98, 16.09, 14.08, 12.44, 0.04, -12.11,
-19.63, -19.37, -31.41, -46.97, -13.48, 0.08, 10.06, 14.25, 14.56,
14.64, 15.08, 15.41, 14.52, 14.7, 14.66, 16.09, 25.1, 25.06,
19.08, 14.26, 12.22, 10.4, -0.33, -0.08, -0.07, -5.03, 0.01,
-0.04, 0.03, -8.67, 4.46, 9.9, 14.29, 16.73, 22.31, 23.03, 22.4,
27.07, 29.33, 33.85, 45.96, 49.82, 47.62, 35.98, 27.26, 27.7,
14.27, 9.32, 0.02, -0.04, 0.01, 0, -0.01, -3.18, -0.05, 0.05,
9.62, 11.54, 14.68, 14.31, 11.18, 12.01, 14.36, 17.55, 28.94,
29.59, 29.42, 28.92, 28.66, 29.74, 24.9, 22.34, 17.93, 15.17,
16.38)), row.names = c(NA, -100L), class = c("tbl_df", "tbl",

The problem is the lag function. This may be because your data is not an R time series, but I'm not sure about that.


lm(EPEX[2:length(EPEX)]~ EPEX[1:(length(EPEX)-1)])

For the simple case that works!

If I want to use more variables than I also have to adjust the length of them, isn´t it?

testmodel = lm(EPEX[2:length(EPEX)]~DAP_Fundamental[2:length(EPEX)]+Ramp_Up[2:length(EPEX)]+Ramp_Down[2:length(EPEX)]+LSR_Low[2:length(EPEX)]+LSR_high[2:length(EPEX)]+EPEX[1:(length(EPEX)-1)], data = Strompreis_Mulitple)

And if I want to have not the privious hour, but the privious day than it looks like this?
testmodel = lm(EPEX[25:length(EPEX)]~DAP_Fundamental[25:length(EPEX)]+Ramp_Up[25:length(EPEX)]+Ramp_Down[25:length(EPEX)]+LSR_Low[25:length(EPEX)]+LSR_high[25:length(EPEX)]+EPEX[1:(length(EPEX)-24)], data = Strompreis_Mulitple)

Thank´s a lot!

I will not close this topic/case yet, because it´s allready late maybe I overlook somethin and I might find some other problems in the next days.

That looks right to me. Check to see that the results (except for the intercept) come close to explicit use of AR.

I just can validate it with the simple case (one variable), but here event the intercept is almost equal.