Select variables with significant P values

I have my linear model made and now need to refine it by first identifying variables that are significant to sales.

model1 <- lm(Final_Sales_df$'Actual Net Sales' ~ ., data = Final_Sales_df[,c(4,5,6:91)])

I have exported the model summary into sheet 1 of excel, however it does not have the significant codes associated with each variable. Other than manually going down each row to identify what variables are significant in excel, is there an easy way to add another column in my excel file displaying the significant *'s?

Lastly, how would I rerun the model using just the significant variables that have been identified?

Thanks in advance for your help.

You could use the broom package to make a neat data frame of the fit summary and add a column with the significance symbols.

set.seed(123)
DF <- data.frame(X1 = rnorm(25),
                 X2 = runif(25,min = 5, max = 15))
DF$Y <- DF$X1 * 2.5 - DF$X2 * 0.95 + 6 + rnorm(25, mean = 0, sd = 4)
FIT <- lm(Y ~ X1 + X2, data = DF)
summary(FIT)
#> 
#> Call:
#> lm(formula = Y ~ X1 + X2, data = DF)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.5982 -2.8848 -0.6088  1.7808  6.9164 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)   5.9890     2.9948   2.000  0.05802 . 
#> X1            2.9134     0.8674   3.359  0.00284 **
#> X2           -0.9436     0.2901  -3.253  0.00365 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.959 on 22 degrees of freedom
#> Multiple R-squared:  0.5471, Adjusted R-squared:  0.5059 
#> F-statistic: 13.29 on 2 and 22 DF,  p-value: 0.0001644

FIT_tidy <- broom::tidy(FIT)
FIT_tidy
#> # A tibble: 3 × 5
#>   term        estimate std.error statistic p.value
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept)    5.99      2.99       2.00 0.0580 
#> 2 X1             2.91      0.867      3.36 0.00284
#> 3 X2            -0.944     0.290     -3.25 0.00365
library(dplyr)
FIT_tidy <- FIT_tidy |> 
  mutate(Sig. = case_when(
    p.value <= 0.001 ~ "***",
    p.value > 0.001 & p.value <= 0.01 ~ "**",
    p.value > 0.01 & p.value <= 0.05 ~ "*",
    p.value > 0.05 & p.value <= 0.10 ~ ".",
    TRUE ~""
  ))
FIT_tidy
#> # A tibble: 3 × 6
#>   term        estimate std.error statistic p.value Sig. 
#>   <chr>          <dbl>     <dbl>     <dbl>   <dbl> <chr>
#> 1 (Intercept)    5.99      2.99       2.00 0.0580  .    
#> 2 X1             2.91      0.867      3.36 0.00284 **   
#> 3 X2            -0.944     0.290     -3.25 0.00365 **

Created on 2022-11-15 with reprex v2.0.2

Awesome, that worked and now I have the significance codes in my excel file. Thank you! Now, how do I rerun the regression model with just those significant variables?

I added a third variable to the fit and wrote some code to construct a formula from only the "significant" variables.

set.seed(123)
DF <- data.frame(X1 = rnorm(25),
                 X2 = runif(25,min = 5, max = 15),
                 X3 = rnorm(25))
DF$Y <- DF$X1 * 2.5 - DF$X2 * 0.95 + 6 + rnorm(25, mean = 0, sd = 4)
FIT <- lm(Y ~ X1 + X2 + X3, data = DF)
summary(FIT)
#> 
#> Call:
#> lm(formula = Y ~ X1 + X2 + X3, data = DF)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -7.0346 -1.6025 -0.4424  2.4654  7.4917 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   9.0421     2.5775   3.508  0.00209 ** 
#> X1            2.7930     0.7504   3.722  0.00126 ** 
#> X2           -1.3054     0.2497  -5.229  3.5e-05 ***
#> X3           -0.2201     0.7340  -0.300  0.76725    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.407 on 21 degrees of freedom
#> Multiple R-squared:  0.7028, Adjusted R-squared:  0.6604 
#> F-statistic: 16.55 on 3 and 21 DF,  p-value: 9.465e-06
FIT_tidy <- broom::tidy(FIT)
FIT_tidy
#> # A tibble: 4 × 5
#>   term        estimate std.error statistic   p.value
#>   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)    9.04      2.58      3.51  0.00209  
#> 2 X1             2.79      0.750     3.72  0.00126  
#> 3 X2            -1.31      0.250    -5.23  0.0000350
#> 4 X3            -0.220     0.734    -0.300 0.767
library(dplyr)

FIT_tidy <- FIT_tidy |> 
  mutate(Sig. = case_when(
    p.value <= 0.001 ~ "***",
    p.value > 0.001 & p.value <= 0.01 ~ "**",
    p.value > 0.01 & p.value <= 0.05 ~ "*",
    p.value > 0.05 & p.value <= 0.10 ~ ".",
    TRUE ~""
  ))
FIT_tidy
#> # A tibble: 4 × 6
#>   term        estimate std.error statistic   p.value Sig. 
#>   <chr>          <dbl>     <dbl>     <dbl>     <dbl> <chr>
#> 1 (Intercept)    9.04      2.58      3.51  0.00209   "**" 
#> 2 X1             2.79      0.750     3.72  0.00126   "**" 
#> 3 X2            -1.31      0.250    -5.23  0.0000350 "***"
#> 4 X3            -0.220     0.734    -0.300 0.767     ""

VARS <- FIT_tidy |> filter(p.value < 0.01, term != "(Intercept)") |> 
  select(term)
VARS
#> # A tibble: 2 × 1
#>   term 
#>   <chr>
#> 1 X1   
#> 2 X2
RHS <- paste(unlist(VARS), collapse = "+")
FormulaText <- paste("Y", "~", RHS, collapse = "")
FormulaText
#> [1] "Y ~ X1+X2"
Formula <- as.formula(FormulaText)
NewFIT <- lm(Formula, data = DF)
summary(NewFIT)
#> 
#> Call:
#> lm(formula = Formula, data = DF)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -6.7347 -1.6479 -0.3427  2.2394  7.6502 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   9.0427     2.5236   3.583  0.00166 ** 
#> X1            2.7703     0.7309   3.790  0.00100 ** 
#> X2           -1.3057     0.2444  -5.342 2.31e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.336 on 22 degrees of freedom
#> Multiple R-squared:  0.7015, Adjusted R-squared:  0.6744 
#> F-statistic: 25.86 on 2 and 22 DF,  p-value: 1.674e-06

Created on 2022-11-15 with reprex v2.0.2

I am tracking with you so far and it makes sense what you proposed. However, when I got to the Formula <- as.formula(FormulaText) line, it produced the error message below.

Error in str2lang(x) : :1:368: unexpected numeric constant

Is this due to how FormulaText was constructed?

What is the value of FormulaText in your code?

Got it figured out, had some dates that were displaying as numbers so relabeled those and it worked! Thank you for your help!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.