How to interpret linear regression coefficients ?

We are trying to understand the impact of number of workdays on sales.
Please find reprex below:

library(tidyverse)

# Work days for January from 2010 - 2018
data = data.frame(work_days = c(20,21,22,20,20,22,21,21),
           sale = c(1205,2111,2452,2054,2440,1212,1211,2111))

# Apply linear regression
model = lm(sale ~ work_days, data)

summary(model)
Call:
lm(formula = sale ~ work_days, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-677.8 -604.5  218.7  339.0  645.3 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  2643.82    5614.16   0.471    0.654
work_days     -38.05     268.75  -0.142    0.892

Residual standard error: 593.4 on 6 degrees of freedom
Multiple R-squared:  0.00333,	Adjusted R-squared:  -0.1628 
F-statistic: 0.02005 on 1 and 6 DF,  p-value: 0.892

Could you please help me understand if the coefficients
Every work day decreases the sale by 38.05 ?

##############################################


data = data.frame(work_days = c(20,21,22,20,20,22,21,21),
           sale = c(1212,1211,2111,1205,2111,2452,2054,2440))

model = lm(sale ~ work_days, data)

summary(model)
Call:
lm(formula = sale ~ work_days, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-686.8 -301.0   -8.6  261.3  599.7 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -6220.0     4555.9  -1.365    0.221
work_days      386.6      218.1   1.772    0.127

Residual standard error: 481.5 on 6 degrees of freedom
Multiple R-squared:  0.3437,	Adjusted R-squared:  0.2343 
F-statistic: 3.142 on 1 and 6 DF,  p-value: 0.1267

Does this mean,
Every workday increases the sales by 387 ?
How about the negative intercept ?

Hi @AbhishekHP,

looking only at the coefficients is a bit risky. Your first regression has an R^2 of practically zero so you should not interpret anything really. The problem is that your x variable (work_days) has very little variation. Theoretically the smaller the variation in the dependent variable, the larger the error of the OLS estimator. Intuitively, if your x is almost constant it can barely have a chance to "explain" any variation in a given y.

Further, in both regressions the p-values of both coefficients are above 0.1 (which relates again to the issue of too little variation in your x variable). So, I wouldn't conclude anything based on these regressions apart from that there is insufficient evidence (data) to show any relation between work_days and sale.

3 Likes

I agree completely with valeri. Plotting the data can help you understand what you are dealing with. Neither data set shows a convincing trend, especially the first one.

library(ggplot2)
data = data.frame(work_days = c(20,21,22,20,20,22,21,21),
                  sale = c(1205,2111,2452,2054,2440,1212,1211,2111))
ggplot(data, aes(work_days, sale)) + geom_point() + geom_smooth(method = "lm")

summary(lm(sale ~ work_days, data))
#> 
#> Call:
#> lm(formula = sale ~ work_days, data = data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -677.8 -604.5  218.7  339.0  645.3 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  2643.82    5614.16   0.471    0.654
#> work_days     -38.05     268.75  -0.142    0.892
#> 
#> Residual standard error: 593.4 on 6 degrees of freedom
#> Multiple R-squared:  0.00333,    Adjusted R-squared:  -0.1628 
#> F-statistic: 0.02005 on 1 and 6 DF,  p-value: 0.892


data = data.frame(work_days = c(20,21,22,20,20,22,21,21),
                  sale = c(1212,1211,2111,1205,2111,2452,2054,2440))
ggplot(data, aes(work_days, sale)) + geom_point() + geom_smooth(method = "lm")

summary(lm(sale ~ work_days, data))
#> 
#> Call:
#> lm(formula = sale ~ work_days, data = data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -686.8 -301.0   -8.6  261.3  599.7 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  -6220.0     4555.9  -1.365    0.221
#> work_days      386.6      218.1   1.772    0.127
#> 
#> Residual standard error: 481.5 on 6 degrees of freedom
#> Multiple R-squared:  0.3437, Adjusted R-squared:  0.2343 
#> F-statistic: 3.142 on 1 and 6 DF,  p-value: 0.1267

Created on 2019-09-16 by the reprex package (v0.2.1)

1 Like

Thanks for detailed solution.
Could you please help me understand what does F-statistic say (interpretation) ? "0.02005 on 1 and 6 DF"
Adjusted R-square even mean ?

Try these links for explanations of the standard summary.lm output:

1 Like

On the F-statistic, referring to Wikipedia here (see below). The 1 and 6 degrees of freedom refer to the so-called numerator and denominator degrees of freedom. In this particular F-test we are testing whether the regression model (in this case the single x variable with 2 parameters - intercept and slope - call this p_2 as in the wiki article below) helps explain the variation in y better than just a simple model with an intercept (which would be the mean) - that is the nested model with 1 parameter - so the numerator df = 2-1 = 1. The denominator df is n - p_2 = 8 - 2 = 6, where n is the number of observations in your sample.

Consider two models, 1 and 2, where model 1 is 'nested' within model 2. Model 1 is the restricted model, and model 2 is the unrestricted one. That is, model 1 has p 1 parameters, and model 2 has p 2 parameters, where p 1 < p 2, and for any choice of parameters in model 1, the same regression curve can be achieved by some choice of the parameters of model 2.

One common context in this regard is that of deciding whether a model fits the data significantly better than does a naive model, in which the only explanatory term is the intercept term, so that all predicted values for the dependent variable are set equal to that variable's sample mean. The naive model is the restricted model, since the coefficients of all potential explanatory variables are restricted to equal zero.

Another common context is deciding whether there is a structural break in the data: here the restricted model uses all data in one regression, while the unrestricted model uses separate regressions for two different subsets of the data. This use of the F-test is known as the Chow test.

The model with more parameters will always be able to fit the data at least as well as the model with fewer parameters. Thus typically model 2 will give a better (i.e. lower error) fit to the data than model 1. But one often wants to determine whether model 2 gives a significantly better fit to the data. One approach to this problem is to use an F -test.

If there are n data points to estimate parameters of both models from, then one can calculate the F statistic, given by

F = \frac {\left({\frac {{\text{RSS}}_{1}-{\text{RSS}}_{2}}{p_{2}-p_{1}}}\right)} {\left({\frac {{\text{RSS}}_{2}}{n-p_{2}}}\right)}

where RSS i is the residual sum of squares of model i . If the regression model has been calculated with weights, then replace RSS i with χ2, the weighted sum of squared residuals. Under the null hypothesis that model 2 does not provide a significantly better fit than model 1, F will have an F distribution, with ( p 2− p 1, np 2) degrees of freedom. The null hypothesis is rejected if the F calculated from the data is greater than the critical value of the F -distribution for some desired false-rejection probability (e.g. 0.05). The F -test is a Wald test.

1 Like

Thank you for taking time to educate us

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.