Specifying formulas to controlling the form of interactions using `lm`

torgo · March 17, 2023, 8:11pm

Here's a simple example that illustrates my question:

df <- data.frame(y = rnorm(10), x = rnorm(10), z = sample(c("a","b"), size = 10, replace = TRUE))

Using the * operator gives me a regression of y on 1, 1[z = b], x, 1[z=b]x.

> lm(data = df, y ~ as.factor(z)*x)

Call:
lm(formula = y ~ as.factor(z) * x, data = df)

Coefficients:
    (Intercept)    as.factor(z)b                x  as.factor(z)b:x  
        -0.2351           0.1524           0.2309          -0.2699

I would like to regress y on 1[z = a], 1[z=a]x, 1[z=b], 1[z=b]x (with no constant term). This regression will produce the same fitted values as the one above, but the interpretation of the coefficients is different, and preferable in some cases. How can I specify the formula to do this in a single regression?

Max · March 19, 2023, 2:41pm

In formulas, -1 or +0 specifies no intercept.

Maybe this does what you want:

set.seed(1)
df <-
  data.frame(
    y = rnorm(10),
    x = rnorm(10),
    z = factor(sample(c("a", "b"), size = 10, replace = TRUE))
  )

model_1 <- lm(y ~ z * x, data = df)
summary(model_1)
#> 
#> Call:
#> lm(formula = y ~ z * x, data = df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.2091 -0.2834  0.0853  0.3167  0.7597 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  -0.3996     0.4209  -0.950    0.379
#> zb            0.7696     0.4946   1.556    0.171
#> x             0.7982     0.6132   1.302    0.241
#> zb:x         -1.2128     0.6526  -1.858    0.112
#> 
#> Residual standard error: 0.6726 on 6 degrees of freedom
#> Multiple R-squared:  0.505,  Adjusted R-squared:  0.2575 
#> F-statistic: 2.041 on 3 and 6 DF,  p-value: 0.2098

model_2 <- lm(y ~ z * x - 1, data = df)
summary(model_2)
#> 
#> Call:
#> lm(formula = y ~ z * x - 1, data = df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.2091 -0.2834  0.0853  0.3167  0.7597 
#> 
#> Coefficients:
#>      Estimate Std. Error t value Pr(>|t|)
#> za    -0.3996     0.4209  -0.950    0.379
#> zb     0.3700     0.2599   1.424    0.204
#> x      0.7982     0.6132   1.302    0.241
#> zb:x  -1.2128     0.6526  -1.858    0.112
#> 
#> Residual standard error: 0.6726 on 6 degrees of freedom
#> Multiple R-squared:  0.5203, Adjusted R-squared:  0.2005 
#> F-statistic: 1.627 on 4 and 6 DF,  p-value: 0.2827

^{Created on 2023-03-19 with reprex v2.0.2}

If you want a fuller parameterization you might need to make the indicators yourself.

torgo · March 19, 2023, 5:27pm

Thanks. That's not quite what I want because I also want to "remove the intercept" on the x term. So that it's za, zb, za:x, and zb:x. I guess I may have to construct them manually, as you suggest.

williaml · March 20, 2023, 2:38am

Doesn't Max's second model do what you want?

torgo · March 20, 2023, 4:03pm

No because it's still "main effect" for x plus the incremental difference for group b.
Here's how you would do what I want by hand (creating new variables, which is what I was hoping to avoid).

set.seed(1)
df <- data.frame(y = rnorm(10), x = rnorm(10), z = factor(sample(c("a","b"))),
                 size = 10, replace = TRUE)

df$xa <- df$x * (df$z == "a")
df$xb <- df$x * (df$z == "b")

# Max's solution
> lm(data = df, y ~ 0 + x*z)

Call:
lm(formula = y ~ 0 + x * z, data = df)

Coefficients:
       x        za        zb      x:zb  
-0.44849   0.24849  -0.09732   0.59642  

> lm(data = df, y ~ 0 + z + xa + xb)

Call:
lm(formula = y ~ 0 + z + xa + xb, data = df)

Coefficients:
      za        zb        xa        xb  
 0.24849  -0.09732  -0.44849   0.14793

My desired parameterization is equivalent to running two separate regressions subset by the value of z. But often it is useful to have all of the coefficients in a single regression.

nirgrahamuk · March 20, 2023, 4:20pm

but your handcrafted example is a single regression... and you seem to know how to specify it, so what are you asking for help with ?

are you hoping to in some way automate this part --

df$xa <- df$x * (df$z == "a")
df$xb <- df$x * (df$z == "b")

?

torgo · March 20, 2023, 8:19pm

Yes, this was just intended as a simple MWE. My question is whether there's functionality within formula specification syntax that can be used to automate this in more complicated examples.

system · April 10, 2023, 8:19pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.