Error: variable lengths differ

Hi, I am relatively new to R and am not a statistician.

I have been asked to fit a weighted least squares regression model in a dataset in which a few variables have distributions that are not particularly normal, there is a relatively small % of missing values and there are not that many outliers, apart from one variable with 9%, based on when I ran diagnose_outlier from the dlookr package.

My dataset has 6489 cases with 38 variables, but 2 of these are identifiers, 2 will be the outcome variables (tested separately) and I will be fitting 12 (11 numeric and 1 factor) variables as predictors in my final model.

To practice and ensure I am using the correct codes, I thought I would fit a limited model.
I named my data, database10B
The dependent variable is LE15to19F
The two independent variables are IMDin2019 (numeric) and regioncode_ (factor with 8 levels)
The number of missing values are:
LE15to19F = 12; IMDin2019 = 24; regioncode_ = 0.

When it comes to the whole dataset, the number of missing values is different for each variable. All cases have values for at least most of the variables.

This is what I have done so far:

lm_model01F <- lm(LE15to19F ~ IMDin2019+regioncode_, database10B)

summary(lm_model01F)

Call:
lm(formula = LE15to19F ~ IMDin2019 + regioncode_, data = database10B)

Residuals:
Min 1Q Median 3Q Max
-4.8685 -0.9032 -0.1327 0.7103 9.8577

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 86.500124 0.037134 2329.386 <2e-16 ***
IMDin2019 -0.135691 0.001423 -95.326 <2e-16 ***
regioncode_ 1.039877 0.941622 1.104 0.269

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.331 on 6466 degrees of freedom
(20 observations deleted due to missingness)
Multiple R-squared: 0.5843, Adjusted R-squared: 0.5841
F-statistic: 4544 on 2 and 6466 DF, p-value: < 2.2e-16

sd_variance <- sd(apply(X = database10B[,c('IMDin2019', 'regioncode_')],

  •                     MARGIN = 2,
    
  •                     FUN = var,
    
  •                     na.rm = TRUE))
    

wls_model01F <- lm(LE15to19F ~ IMDin2019 + regioncode_,

  •                data = database10B,
    
  •                weights = 1/sd_variance,
    
  •                na.rm = TRUE)
    

Error in model.frame.default(formula = LE15to19F ~ IMDin2019 + regioncode_, :
variable lengths differ (found for '(weights)')

It looks like everything was working up to when I tried to fit a weighted model. I was struggling before with producing a weights object. Someone kindly suggested that I calculate sd_variance for the variables used in the model, which I think I have worked out how to do, as sd_variance appears as a value (=95.88) in my global environment.

I am not sure how to get around this. I am guessing that the error arises from the differing number of NAs in each column. Looking this up, I am confused by the possible solutions. Is this the problem or is there something else? Do I need to do any additional steps or is there a incorrect code somewhere?

Apologies if I have missed something obvious or basic.
Any advice or guidance would be gratefully received.

Many thanks,
Louis

When you do a weighted regression, you need one value of sd_variance for each observation. And this should represent the standard deviation of the error term in the regression...not the standard deviation of the observed variables.

Thanks. That makes things much clearer for me. I will follow your advice.

Best wishes,
Louis

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.