Problems in R
are best considered as application of school alegbra: f(x) = y, involving the three objects f, x and y, where object f is some function, which may be a composite function, g(f(x)), x, the object to hand (usually the data) and y, some desired object as the return value.
In this case x is a data frame containing one variable, Y and X_1 \dots X_n variables to be regressed against. lm
is available as f and y is to be derived thusly
x <- seq(1:6)
y <- c(17.2,11.0,15,4,9,12)
fit <- lm(x ~ y)
summary(fit)
#>
#> Call:
#> lm(formula = x ~ y)
#>
#> Residuals:
#> 1 2 3 4 5 6
#> -1.3348 -1.5732 0.2258 -0.9715 1.0273 2.6265
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 5.7705 2.1244 2.716 0.0532 .
#> y -0.1998 0.1751 -1.141 0.3177
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.817 on 4 degrees of freedom
#> Multiple R-squared: 0.2454, Adjusted R-squared: 0.05676
#> F-statistic: 1.301 on 1 and 4 DF, p-value: 0.3177
fit$residuals
#> 1 2 3 4 5 6
#> -1.3347786 -1.5732425 0.2257665 -0.9715082 1.0272530 2.6265098
Created on 2020-09-29 by the reprex package (v0.3.0.9001)
Inspection leads to the identification of fit$residuals
as the location of the desired content for each entry in y. It is a numeric vector.
There remains the construction of g() and h(). g will perform the stepwise regressions and h will capture the respective fit$residuals
The foregoing is in aid of emphasizing the functional orientation of R
. Although the language has procedural/imperative constructs available, they are usually not the preferred approach. Programmers experienced in language such as the C
languages and Python should consider implementing functions in those languages and making them available via {Rcpp} or {retriculate}, respectively.
Next, in selecting or constructing g, bear in mind that R
is a mature language developed and maintained by and for statisticians. Except at pre-publication stages of statistical theory advances, one or more packages already exist for most problems involving statistical analysis. What's more, the basics have had 30 years' of development and scrutiny.
To begin, rseek, an R
tuned front end to Google is a good source. The first hit on stepwise
is the {olsrr} package vignette.
library(olsrr)
#>
#> Attaching package: 'olsrr'
#> The following object is masked from 'package:datasets':
#>
#> rivers
model <- lm(y ~ ., data = surgical)
ols_step_forward_p(model, details = TRUE) -> a
#> Forward Selection Method
#> ---------------------------
#>
#> Candidate Terms:
#>
#> 1. bcs
#> 2. pindex
#> 3. enzyme_test
#> 4. liver_test
#> 5. age
#> 6. gender
#> 7. alc_mod
#> 8. alc_heavy
#>
#> We are selecting variables based on p value...
#>
#>
#> Forward Selection: Step 1
#>
#> - liver_test
#>
#> Model Summary
#> -----------------------------------------------------------------
#> R 0.674 RMSE 296.299
#> R-Squared 0.455 Coef. Var 42.202
#> Adj. R-Squared 0.444 MSE 87793.232
#> Pred R-Squared 0.386 MAE 212.857
#> -----------------------------------------------------------------
#> RMSE: Root Mean Square Error
#> MSE: Mean Square Error
#> MAE: Mean Absolute Error
#>
#> ANOVA
#> -----------------------------------------------------------------------
#> Sum of
#> Squares DF Mean Square F Sig.
#> -----------------------------------------------------------------------
#> Regression 3804272.477 1 3804272.477 43.332 0.0000
#> Residual 4565248.060 52 87793.232
#> Total 8369520.537 53
#> -----------------------------------------------------------------------
#>
#> Parameter Estimates
#> -------------------------------------------------------------------------------------------
#> model Beta Std. Error Std. Beta t Sig lower upper
#> -------------------------------------------------------------------------------------------
#> (Intercept) 15.191 111.869 0.136 0.893 -209.290 239.671
#> liver_test 250.305 38.025 0.674 6.583 0.000 174.003 326.607
#> -------------------------------------------------------------------------------------------
#>
#>
#>
#> Forward Selection: Step 2
#>
#> - alc_heavy
#>
#> Model Summary
#> -----------------------------------------------------------------
#> R 0.753 RMSE 266.648
#> R-Squared 0.567 Coef. Var 37.979
#> Adj. R-Squared 0.550 MSE 71101.387
#> Pred R-Squared 0.487 MAE 187.393
#> -----------------------------------------------------------------
#> RMSE: Root Mean Square Error
#> MSE: Mean Square Error
#> MAE: Mean Absolute Error
#>
#> ANOVA
#> -----------------------------------------------------------------------
#> Sum of
#> Squares DF Mean Square F Sig.
#> -----------------------------------------------------------------------
#> Regression 4743349.776 2 2371674.888 33.356 0.0000
#> Residual 3626170.761 51 71101.387
#> Total 8369520.537 53
#> -----------------------------------------------------------------------
#>
#> Parameter Estimates
#> --------------------------------------------------------------------------------------------
#> model Beta Std. Error Std. Beta t Sig lower upper
#> --------------------------------------------------------------------------------------------
#> (Intercept) -5.069 100.828 -0.050 0.960 -207.490 197.352
#> liver_test 234.597 34.491 0.632 6.802 0.000 165.353 303.841
#> alc_heavy 342.183 94.156 0.338 3.634 0.001 153.157 531.208
#> --------------------------------------------------------------------------------------------
#>
#>
#>
#> Forward Selection: Step 3
#>
#> - enzyme_test
#>
#> Model Summary
#> -----------------------------------------------------------------
#> R 0.812 RMSE 238.914
#> R-Squared 0.659 Coef. Var 34.029
#> Adj. R-Squared 0.639 MSE 57080.128
#> Pred R-Squared 0.567 MAE 170.603
#> -----------------------------------------------------------------
#> RMSE: Root Mean Square Error
#> MSE: Mean Square Error
#> MAE: Mean Absolute Error
#>
#> ANOVA
#> -----------------------------------------------------------------------
#> Sum of
#> Squares DF Mean Square F Sig.
#> -----------------------------------------------------------------------
#> Regression 5515514.136 3 1838504.712 32.209 0.0000
#> Residual 2854006.401 50 57080.128
#> Total 8369520.537 53
#> -----------------------------------------------------------------------
#>
#> Parameter Estimates
#> ---------------------------------------------------------------------------------------------
#> model Beta Std. Error Std. Beta t Sig lower upper
#> ---------------------------------------------------------------------------------------------
#> (Intercept) -344.559 129.156 -2.668 0.010 -603.976 -85.141
#> liver_test 183.844 33.845 0.495 5.432 0.000 115.865 251.823
#> alc_heavy 319.662 84.585 0.315 3.779 0.000 149.769 489.555
#> enzyme_test 6.263 1.703 0.335 3.678 0.001 2.843 9.683
#> ---------------------------------------------------------------------------------------------
#>
#>
#>
#> Forward Selection: Step 4
#>
#> - pindex
#>
#> Model Summary
#> -----------------------------------------------------------------
#> R 0.866 RMSE 206.584
#> R-Squared 0.750 Coef. Var 29.424
#> Adj. R-Squared 0.730 MSE 42676.744
#> Pred R-Squared 0.669 MAE 146.473
#> -----------------------------------------------------------------
#> RMSE: Root Mean Square Error
#> MSE: Mean Square Error
#> MAE: Mean Absolute Error
#>
#> ANOVA
#> -----------------------------------------------------------------------
#> Sum of
#> Squares DF Mean Square F Sig.
#> -----------------------------------------------------------------------
#> Regression 6278360.060 4 1569590.015 36.779 0.0000
#> Residual 2091160.477 49 42676.744
#> Total 8369520.537 53
#> -----------------------------------------------------------------------
#>
#> Parameter Estimates
#> -----------------------------------------------------------------------------------------------
#> model Beta Std. Error Std. Beta t Sig lower upper
#> -----------------------------------------------------------------------------------------------
#> (Intercept) -789.012 153.372 -5.144 0.000 -1097.226 -480.799
#> liver_test 125.474 32.358 0.338 3.878 0.000 60.448 190.499
#> alc_heavy 359.875 73.754 0.355 4.879 0.000 211.660 508.089
#> enzyme_test 7.548 1.503 0.404 5.020 0.000 4.527 10.569
#> pindex 7.876 1.863 0.335 4.228 0.000 4.133 11.620
#> -----------------------------------------------------------------------------------------------
#>
#>
#>
#> Forward Selection: Step 5
#>
#> - bcs
#>
#> Model Summary
#> -----------------------------------------------------------------
#> R 0.884 RMSE 195.454
#> R-Squared 0.781 Coef. Var 27.839
#> Adj. R-Squared 0.758 MSE 38202.426
#> Pred R-Squared 0.700 MAE 137.656
#> -----------------------------------------------------------------
#> RMSE: Root Mean Square Error
#> MSE: Mean Square Error
#> MAE: Mean Absolute Error
#>
#> ANOVA
#> -----------------------------------------------------------------------
#> Sum of
#> Squares DF Mean Square F Sig.
#> -----------------------------------------------------------------------
#> Regression 6535804.090 5 1307160.818 34.217 0.0000
#> Residual 1833716.447 48 38202.426
#> Total 8369520.537 53
#> -----------------------------------------------------------------------
#>
#> Parameter Estimates
#> ------------------------------------------------------------------------------------------------
#> model Beta Std. Error Std. Beta t Sig lower upper
#> ------------------------------------------------------------------------------------------------
#> (Intercept) -1178.330 208.682 -5.647 0.000 -1597.914 -758.746
#> liver_test 58.064 40.144 0.156 1.446 0.155 -22.652 138.779
#> alc_heavy 317.848 71.634 0.314 4.437 0.000 173.818 461.878
#> enzyme_test 9.748 1.656 0.521 5.887 0.000 6.419 13.077
#> pindex 8.924 1.808 0.380 4.935 0.000 5.288 12.559
#> bcs 59.864 23.060 0.241 2.596 0.012 13.498 106.230
#> ------------------------------------------------------------------------------------------------
#>
#>
#>
#> No more variables to be added.
#>
#> Variables Entered:
#>
#> + liver_test
#> + alc_heavy
#> + enzyme_test
#> + pindex
#> + bcs
#>
#>
#> Final Model Output
#> ------------------
#>
#> Model Summary
#> -----------------------------------------------------------------
#> R 0.884 RMSE 195.454
#> R-Squared 0.781 Coef. Var 27.839
#> Adj. R-Squared 0.758 MSE 38202.426
#> Pred R-Squared 0.700 MAE 137.656
#> -----------------------------------------------------------------
#> RMSE: Root Mean Square Error
#> MSE: Mean Square Error
#> MAE: Mean Absolute Error
#>
#> ANOVA
#> -----------------------------------------------------------------------
#> Sum of
#> Squares DF Mean Square F Sig.
#> -----------------------------------------------------------------------
#> Regression 6535804.090 5 1307160.818 34.217 0.0000
#> Residual 1833716.447 48 38202.426
#> Total 8369520.537 53
#> -----------------------------------------------------------------------
#>
#> Parameter Estimates
#> ------------------------------------------------------------------------------------------------
#> model Beta Std. Error Std. Beta t Sig lower upper
#> ------------------------------------------------------------------------------------------------
#> (Intercept) -1178.330 208.682 -5.647 0.000 -1597.914 -758.746
#> liver_test 58.064 40.144 0.156 1.446 0.155 -22.652 138.779
#> alc_heavy 317.848 71.634 0.314 4.437 0.000 173.818 461.878
#> enzyme_test 9.748 1.656 0.521 5.887 0.000 6.419 13.077
#> pindex 8.924 1.808 0.380 4.935 0.000 5.288 12.559
#> bcs 59.864 23.060 0.241 2.596 0.012 13.498 106.230
#> ------------------------------------------------------------------------------------------------
a
#>
#> Selection Summary
#> ------------------------------------------------------------------------------
#> Variable Adj.
#> Step Entered R-Square R-Square C(p) AIC RMSE
#> ------------------------------------------------------------------------------
#> 1 liver_test 0.4545 0.4440 62.5119 771.8753 296.2992
#> 2 alc_heavy 0.5667 0.5498 41.3681 761.4394 266.6484
#> 3 enzyme_test 0.6590 0.6385 24.3379 750.5089 238.9145
#> 4 pindex 0.7501 0.7297 7.5373 735.7146 206.5835
#> 5 bcs 0.7809 0.7581 3.1925 730.6204 195.4544
#> ------------------------------------------------------------------------------
str(a)
#> List of 11
#> $ predictors: chr [1:5] "liver_test" "alc_heavy" "enzyme_test" "pindex" ...
#> $ mallows_cp: num [1:5] 62.51 41.37 24.34 7.54 3.19
#> $ indvar : chr [1:8] "bcs" "pindex" "enzyme_test" "liver_test" ...
#> $ rsquare : num [1:5] 0.455 0.567 0.659 0.75 0.781
#> $ steps : num 5
#> $ sbic : num [1:5] 616 606 595 583 580
#> $ adjr : num [1:5] 0.444 0.55 0.639 0.73 0.758
#> $ rmse : num [1:5] 296 267 239 207 195
#> $ aic : num [1:5] 772 761 751 736 731
#> $ sbc : num [1:5] 778 769 760 748 745
#> $ model :List of 12
#> ..$ coefficients : Named num [1:6] -1178.33 58.06 317.85 9.75 8.92 ...
#> .. ..- attr(*, "names")= chr [1:6] "(Intercept)" "liver_test" "alc_heavy" "enzyme_test" ...
#> ..$ residuals : Named num [1:54] -20.99 7.46 2.2 -29.59 785.83 ...
#> .. ..- attr(*, "names")= chr [1:54] "1" "2" "3" "4" ...
#> ..$ effects : Named num [1:54] -5159 1950 969 879 -873 ...
#> .. ..- attr(*, "names")= chr [1:54] "(Intercept)" "liver_test" "alc_heavy" "enzyme_test" ...
#> ..$ rank : int 6
#> ..$ fitted.values: Named num [1:54] 716 396 708 379 1557 ...
#> .. ..- attr(*, "names")= chr [1:54] "1" "2" "3" "4" ...
#> ..$ assign : int [1:6] 0 1 2 3 4 5
#> ..$ qr :List of 5
#> .. ..$ qr : num [1:54, 1:6] -7.348 0.136 0.136 0.136 0.136 ...
#> .. .. ..- attr(*, "dimnames")=List of 2
#> .. .. .. ..$ : chr [1:54] "1" "2" "3" "4" ...
#> .. .. .. ..$ : chr [1:6] "(Intercept)" "liver_test" "alc_heavy" "enzyme_test" ...
#> .. .. ..- attr(*, "assign")= int [1:6] 0 1 2 3 4 5
#> .. ..$ qraux: num [1:6] 1.14 1.13 1.05 1.22 1.01 ...
#> .. ..$ pivot: int [1:6] 1 2 3 4 5 6
#> .. ..$ tol : num 1e-07
#> .. ..$ rank : int 6
#> .. ..- attr(*, "class")= chr "qr"
#> ..$ df.residual : int 48
#> ..$ xlevels : Named list()
#> ..$ call : language lm(formula = paste(response, "~", paste(preds, collapse = " + ")), data = l)
#> ..$ terms :Classes 'terms', 'formula' language y ~ liver_test + alc_heavy + enzyme_test + pindex + bcs
#> .. .. ..- attr(*, "variables")= language list(y, liver_test, alc_heavy, enzyme_test, pindex, bcs)
#> .. .. ..- attr(*, "factors")= int [1:6, 1:5] 0 1 0 0 0 0 0 0 1 0 ...
#> .. .. .. ..- attr(*, "dimnames")=List of 2
#> .. .. .. .. ..$ : chr [1:6] "y" "liver_test" "alc_heavy" "enzyme_test" ...
#> .. .. .. .. ..$ : chr [1:5] "liver_test" "alc_heavy" "enzyme_test" "pindex" ...
#> .. .. ..- attr(*, "term.labels")= chr [1:5] "liver_test" "alc_heavy" "enzyme_test" "pindex" ...
#> .. .. ..- attr(*, "order")= int [1:5] 1 1 1 1 1
#> .. .. ..- attr(*, "intercept")= int 1
#> .. .. ..- attr(*, "response")= int 1
#> .. .. ..- attr(*, ".Environment")=<environment: 0x55bb376aeea0>
#> .. .. ..- attr(*, "predvars")= language list(y, liver_test, alc_heavy, enzyme_test, pindex, bcs)
#> .. .. ..- attr(*, "dataClasses")= Named chr [1:6] "numeric" "numeric" "numeric" "numeric" ...
#> .. .. .. ..- attr(*, "names")= chr [1:6] "y" "liver_test" "alc_heavy" "enzyme_test" ...
#> ..$ model :'data.frame': 54 obs. of 6 variables:
#> .. ..$ y : int [1:54] 695 403 710 349 2343 348 518 749 1056 968 ...
#> .. ..$ liver_test : num [1:54] 2.59 1.7 2.16 2.01 4.3 1.42 1.91 2.57 2.5 2.4 ...
#> .. ..$ alc_heavy : int [1:54] 0 0 0 0 1 0 1 0 0 0 ...
#> .. ..$ enzyme_test: int [1:54] 81 66 83 41 115 72 63 81 93 94 ...
#> .. ..$ pindex : int [1:54] 62 59 57 73 65 38 46 68 67 76 ...
#> .. ..$ bcs : num [1:54] 6.7 5.1 7.4 6.5 7.8 5.8 5.7 3.7 6 3.7 ...
#> .. ..- attr(*, "terms")=Classes 'terms', 'formula' language y ~ liver_test + alc_heavy + enzyme_test + pindex + bcs
#> .. .. .. ..- attr(*, "variables")= language list(y, liver_test, alc_heavy, enzyme_test, pindex, bcs)
#> .. .. .. ..- attr(*, "factors")= int [1:6, 1:5] 0 1 0 0 0 0 0 0 1 0 ...
#> .. .. .. .. ..- attr(*, "dimnames")=List of 2
#> .. .. .. .. .. ..$ : chr [1:6] "y" "liver_test" "alc_heavy" "enzyme_test" ...
#> .. .. .. .. .. ..$ : chr [1:5] "liver_test" "alc_heavy" "enzyme_test" "pindex" ...
#> .. .. .. ..- attr(*, "term.labels")= chr [1:5] "liver_test" "alc_heavy" "enzyme_test" "pindex" ...
#> .. .. .. ..- attr(*, "order")= int [1:5] 1 1 1 1 1
#> .. .. .. ..- attr(*, "intercept")= int 1
#> .. .. .. ..- attr(*, "response")= int 1
#> .. .. .. ..- attr(*, ".Environment")=<environment: 0x55bb376aeea0>
#> .. .. .. ..- attr(*, "predvars")= language list(y, liver_test, alc_heavy, enzyme_test, pindex, bcs)
#> .. .. .. ..- attr(*, "dataClasses")= Named chr [1:6] "numeric" "numeric" "numeric" "numeric" ...
#> .. .. .. .. ..- attr(*, "names")= chr [1:6] "y" "liver_test" "alc_heavy" "enzyme_test" ...
#> ..- attr(*, "class")= chr "lm"
#> - attr(*, "class")= chr "ols_step_forward_p"
Created on 2020-09-29 by the reprex package (v0.3.0.9001)
Inspection of the output structure will identify locations of the residuals from which h may be composed from ols_step_forward_p()
as g.