problems using regression coefficiecents

Hello everyone,

For a research project, I am supposed to create a regression model from samples that were calculated by a CFD-simulation and then carry out an optimisation with it. I created the regression function with the function lm(...) and according to R, this has an R^2 value of 0.9, an adjusted R^2 of 0.87 and with a cross-validation (10-fold) I achieve an R^2 value of 0.8 or an RMSE of 0.02, which corresponds to about 2%.

For the optimisation I used the package NSGA2R and for this I had to copy the coefficients of the regression function and adjust the names of the independent parameters to x[1],x[2], and so on.

My problem is that the results of the optimisation differ quite a lot from the validation values and when I checked the regression function in Excel, I found that the values differ far too much for the values for R^2 to be correct. In the attached image the simulation results are shown in blue and the values calculated with the regression are shown in yellow, these are the data points with which the regression was created.

My approach to recreate the regression function was to multiply the coefficients by the independent variables, so with the example from the attached code I would get the following function:
DPM = 1.1934 - 0.3384 * d_Slength + 0.3137 * S_CPos + (...) + 0.39896 * d_SLength * S_Cpos + 1.026....

Is this correct at all or do I have to divide the parameters for the mixed terms or adjust them in some other way?

Many thanks for your help and best regards!

Call:
lm(formula = DPM ~ d_Slength + S_Cpos + S_Thickn + d_Salpha + 
    d_Scpos + S_Length + d_Rthickn + d_Rlenght + d_Slength * 
    S_Cpos + d_Slength * S_Thickn + d_Slength * d_Salpha + d_Slength * 
    d_Scpos + d_Slength * S_Length + d_Slength * d_Rthickn + 
    d_Slength * d_Rlenght + S_Cpos * S_Thickn + S_Cpos * d_Salpha + 
    S_Cpos * d_Scpos + S_Cpos * S_Length + S_Cpos * d_Rthickn + 
    S_Cpos * d_Rlenght + S_Thickn * d_Salpha + S_Thickn * d_Scpos + 
    S_Thickn * S_Length + S_Thickn * d_Rthickn + S_Thickn * d_Rlenght + 
    d_Salpha * d_Scpos + d_Salpha * S_Length + d_Salpha * d_Rthickn + 
    d_Salpha * d_Rlenght + d_Scpos * S_Length + d_Scpos * d_Rthickn + 
    d_Scpos * d_Rlenght + S_Length * d_Rthickn + S_Length * d_Rlenght + 
    d_Rthickn * d_Rlenght + I(S_Cpos^2) + I(d_Slength^2) + I(S_Thickn^2), 
    data = DPM_Datenkomplett)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.055933 -0.011381  0.000643  0.009256  0.058509 
Coefficients:
                       Estimate  Std. Error t value             Pr(>|t|)    
(Intercept)          1.19343608  0.04592284  25.988 < 0.0000000000000002 ***
d_Slength           -0.33835784  0.03673592  -9.211  0.00000000000000179 ***
S_Cpos               0.31373074  0.09257402   3.389             0.000962 ***
S_Thickn             1.00152539  0.34618128   2.893             0.004564 ** 
d_Salpha             0.01583747  0.00697137   2.272             0.024959 *  
d_Scpos              0.09200863  0.03938820   2.336             0.021227 *  
S_Length            -0.00118119  0.00023732  -4.977  0.00000228444905966 ***
d_Rthickn            0.05064585  0.04087589   1.239             0.217863    
d_Rlenght           -0.09785938  0.04161753  -2.351             0.020405 *  
I(S_Cpos^2)         -0.22322571  0.07364307  -3.031             0.003010 ** 
I(d_Slength^2)      -0.00374295  0.02470567  -0.152             0.879845    
I(S_Thickn^2)       -0.77449438  1.06629556  -0.726             0.469105    
d_Slength:S_Cpos     0.39895926  0.03916982  10.185 < 0.0000000000000002 ***
d_Slength:S_Thickn   1.02664204  0.15632897   6.567  0.00000000155059495 ***
d_Slength:d_Salpha   0.02368125  0.00377089   6.280  0.00000000621571566 ***
d_Slength:d_Scpos    0.07567882  0.02313591   3.271             0.001414 ** 
d_Slength:S_Length  -0.00034069  0.00019462  -1.751             0.082697 .  
d_Slength:d_Rthickn -0.05231748  0.02185388  -2.394             0.018284 *  
d_Slength:d_Rlenght -0.07596702  0.02423901  -3.134             0.002188 ** 
S_Cpos:S_Thickn     -1.53697942  0.23259719  -6.608  0.00000000127085466 ***
S_Cpos:d_Salpha     -0.03905314  0.00619623  -6.303  0.00000000557503180 ***
S_Cpos:d_Scpos      -0.08924233  0.03563258  -2.505             0.013663 *  
S_Cpos:S_Length      0.00153957  0.00029551   5.210  0.00000083947182373 ***
S_Cpos:d_Rthickn     0.00417650  0.03733823   0.112             0.911133    
S_Cpos:d_Rlenght     0.14838299  0.03797772   3.907             0.000158 ***
S_Thickn:d_Salpha   -0.04721221  0.02402536  -1.965             0.051814 .  
S_Thickn:d_Scpos    -0.42548721  0.15904291  -2.675             0.008555 ** 
S_Thickn:S_Length    0.00204608  0.00126093   1.623             0.107398    
S_Thickn:d_Rthickn   0.05897616  0.14923122   0.395             0.693427    
S_Thickn:d_Rlenght   0.36554665  0.14274810   2.561             0.011739 *  
d_Salpha:d_Scpos    -0.01532567  0.00369658  -4.146  0.00006502238714769 ***
d_Salpha:S_Length    0.00011998  0.00003445   3.482             0.000703 ***
d_Salpha:d_Rthickn   0.00118219  0.00362268   0.326             0.744767    
d_Salpha:d_Rlenght  -0.00185424  0.00370562  -0.500             0.617759    
d_Scpos:S_Length     0.00020800  0.00019323   1.076             0.283985    
d_Scpos:d_Rthickn    0.00738574  0.02376699   0.311             0.756548    
d_Scpos:d_Rlenght    0.06390154  0.02226765   2.870             0.004891 ** 
S_Length:d_Rthickn  -0.00046057  0.00020336  -2.265             0.025398 *  
S_Length:d_Rlenght  -0.00031854  0.00021703  -1.468             0.144915    
d_Rthickn:d_Rlenght -0.01644057  0.02115912  -0.777             0.438756    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.01982 on 115 degrees of freedom
Multiple R-squared:  0.9016,	Adjusted R-squared:  0.8682 
F-statistic: 27.01 on 39 and 115 DF,  p-value: < 0.00000000000000022

my first thought is that * symbol in lm formula does not represent simple multiplication, but factor crossing.

Thanks for your answer. My intention for using d_Slength * S_Cpos within the lm-formula was to include the two-way-interaction of the two variables term in the regression. Which I thought was working because then I got the respective 'd_Slength:S_Cpos' term and its coefficient in the summary.
Is this really the two-way-interaction term or is it something else?
Also even if it is this term is it correct to use it as Coefficient * d_Slength * S_Cpos, where * stands for multiplication to reproduce the regression function?

This is really hard to understand without access to your data and your code

I think it depends on how literal/interpreative you are when you construct your own evaluation of the regression.
does this help ?



dat <- data.frame(a=1:10,
           b=c(1,3,2,4,5,6,8,7,9,10)) %>% mutate(
           y=5*a+b*2 + 3*a*b + 20)
lm(y ~   a, data=dat) # a 39.62
lm(y ~   b, data=dat) # b 39.55
lm(y ~   a+b, data=dat)# a 21.54 ,  b 18.54
lm(y ~   a+b:a, data=dat) # a 6.197 a:b 3.069
lm(y ~   a+b:a +b , data=dat)  # a 5 b 2 a:b 3
lm(y ~   a*b, data=dat)# a 5 b 2 a:b 3 i.e. same as above
lm(y ~   a+a*b, data=dat)# a 5 b 2 a:b 3 i.e. same as above
(lm1 <- lm(y ~   b + a*b, data=dat))#a 5 b 2 a:b 3 i.e. same as above altered order only

would you know that the last 4 have the same calculation or would you have tried differing calculations in order to do the equivalent of predict () on the lm ?

Thanks for the extensive explanation, I am not sure if I got your intention correctly but
the reason why I used the shown kind of defining the lm function is that I have 3 objective functions in total, where not all of them use the same parameters for the first order and other orders.

The most relevant parameters were determined by doing a COI analysis of all 16 parameters in total.

However, what I figured out just now is that the values calculated by the regression get significantly better when I leave out the terms/coefficients that have a p-value > 0.05, which is also included in the summary output. I added the new points in grey to the diagram which is attached to this post.

This solves the initial problem I had but would imply that the R^2 value showed by the summary function can't be valid for the regression based on all included coefficients. Does someone know on what basis this R^2 value is calculated?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.