Linear regression with 1 continuous predictor & 1 categorical predictor

Hi. Suppose I have one continuous predictor X1 and one categorical predictor X2, I do a linear regression, and now I want a prediction for a particular value of X1, averaged of all values of X2. I am not sure how to handle the X2.

df <- data.frame(salary=c(10,20,30,40,50,5,10,15,20,25),
years=c(1,2,3,4,5,1,2,3,4,5),
gender=c("M","M","M","M","M","F","F","F","F","F"))

df$gender <- ifelse(df$gender=="F",0,1)
df$gender <- factor(df$gender)
model <- lm(salary ~ years + gender, df)
summary(model)
newdata <- data.frame(years=1, gender=mean(as.numeric(df$gender)))
predict(model, newdata)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.5000 3.4069 -2.201 0.063600 .
years 7.5000 0.9449 7.937 9.58e-05 ***
gender1 15.0000 2.6726 5.612 0.000805 ***

I get the following error:

Error: variable 'gender' was fitted with type "factor" but type "numeric" was supplied
In addition: Warning message:
In model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
variable 'gender' is not a factor.

I realize I can't really average men and women ...

Using your code, the gender value in "newdata" is 1.5, which obviously makes no sense. If I understand the way you have expressed the question correctly, if you want a prediction for a particular value of X1, averaged of all values of gender, you are effectively asking for a different model:

model2 <- lm(salary ~ years, df)
summary(model2)
newdata <- data.frame(years=1)
predict(model2, newdata)

...which gives the prediction of 7.5.

Stephen

The problem is that gender is a factor in the original model and numeric in predict(). The easiest thing might be to set gender to either 0 or 1 and not make it a factor. I think your code will work then.

Startz, setting gender to 1 and not making it a factor gives

Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.000 6.875 0.000 1.0000
years 7.500 2.073 3.618 0.0068 **
gender NA NA NA NA

Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.27 on 8 degrees of freedom
Multiple R-squared: 0.6207, Adjusted R-squared: 0.5733
F-statistic: 13.09 on 1 and 8 DF, p-value: 0.006801

1
7.5
Warning message:
In predict.lm(model, newdata) :
prediction from a rank-deficient fit may be misleading

Sorry, I meant set gender to 1 for women and 0 for men, as in your line

df$gender <- ifelse(df$gender=="F",0,1)

Actually, I think if you just delete the line

df$gender <- factor(df$gender)

everything will work.

Startz, yes

df$gender <- ifelse(df$gender=="F",0,1)
newdata <- data.frame(years=1, gender=mean(df$gender))

seems to work.

I sort of feel like there is no perfect solution here. I do want gender as a predictor variable in the regression. But I know it is a factor variable.

Would this technique start to get me in trouble if I had several categorical predictors, or categorical predictors with more levels?

If a categorical variable only has two values, then coding it as 0/1 gives us a mean which represents the fraction in the category coded as 1. That's a reasonable thing to do.

If there are more than two values, then this doesn't work. An alternative is to make a 0/1 variable (a dummy variable) for each value. Suppose you have Black/White/Asian. Then for each row one of those variables is a one and the others are zeros. You can then predict using the mean of each variable. (Whether that really gets at what you want is a different question.)

If you include an intercept in the regression then the number of dummy variables is # of categories minus one. If you use a dummy variable for each of three categories then a linear combination of those variables will be perfectly collinear with the constant term (intercept).

Oops, of course. In the example @fcas80 ran there was one dummy. If you have three dummies it is probably best to omit the intercept and use all three. (If you have more than one category, this doesn't work. You have to include the intercept and omit one value from each category or omit the intercept and omit one value from all categories except 1.)