Non-parametric test for testing linear regression slopes for significant differences

Hi there,

Does anyone know a non-parametric test for testing the significant difference between slopes?

I have two independent continuous variables divided into three decades (categories). I am specifically looking for a non-parametric test because I am unable to get my data to be normally distributed and their variances are not the same. And I want to find a p-value for each two decades compared, e.g. decades 1 slope to decade 2 slope, decade 1 slope to decade 3 slope, and decade 2 slope to decade 3 slope.

Here is the data I am working with:

df <- data.frame(
  "d" =  c("D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1",
           "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1",
           "D1", "D1", "D1", "D1", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", 
           "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", 
           "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", 
           "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", 
           "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", 
           "D3"),
  "x" = c(1.560,0.618,1.350,1.610,0.984,20.800,6.300,0.221,6.880,0.784,0.444,11.600,0.986,2.600,3.380,1.170,7.550,
          17.000,29.200,1.750,93.000,5.740,2.630,38.500,7.190,3.330,33.500,12.200,7.950,4.610,29.800,2.730,12.400,
          4.410,1.690,3.350,6.410,1.210,51.200,0.232,0.171,8.550,0.330,1.330,68.500,0.642,0.485,14.400,22.000,5.010,
          13.700,20.200,9.870,184.000,13.800,5.230,NA,62.400,17.400,0.940,NA,34.800,5.250,0.625,NA,118.000,3.320,
          4.420,NA,10.900,20.000,93.100,32.300,7.060,22.600,NA,74.200,22.600,17.100,6.230,2.580,1.680,1.200,NA,31.100,
          19.100,83.300,12.300,14.200,7.600,3.840,2.370,NA,5.610,51.900,9.950,13.400,6.640,1.690,2.120,NA,NA,45.900,
          2.520,NA,69.000,1.280,3.750,NA,81.800,9.360,31.400,NA,30.100,9.940,1.740,104.000,10.200,3.080,NA),
  "y" = c(0.025, 0.075, 0.270, 0.045, 0.045, 0.345, 0.080, 0.030, 0.115, 0.072, 0.025, 0.177, 0.025, 0.039, 0.123,
          0.076, 0.041, 0.471, 0.378, 0.019, 0.599, 0.031, 0.012, 0.251, 0.047, 0.031, 0.713, 0.159, 0.030, 0.096,
          0.306, 0.027, 0.124, 0.050, 0.029, 0.051, 0.115, 0.020, 0.654, 0.049, 0.037, 0.170, 0.063, 0.039, 0.977,
          0.022, 0.049, 0.241, 0.250, 0.045, 0.219, 0.203, 0.149, 1.11, 0.093, 0.088, 0.061, 1.10, 0.059, 0.035,
          0.017, 0.656, 0.071, 0.019, 0.025, 0.323, 0.042, 0.016, 0.037, 0.166, 0.355, 0.321, 0.268, 0.066, 0.124,
          0.039, 0.252, 0.094, 0.115, 0.038, 0.028, 0.058, 0.035, 0.044, 0.447, 0.155, 0.474, 0.070, 0.061, 0.047,
          0.025, 0.012, 0.030, 0.121, 0.309, 0.039, 0.109, 0.035, 0.036, 0.016, 0.015, 0.547, 0.277, 0.021, 0.010,
          0.969, 0.123, 0.023, 0.020, 0.641, 0.129, 0.160, 0.029, 0.211, 0.043, 0.022, 1.28, 0.088, 0.015, 0.010)
  
)

I was able to do this with an ANOVA test using this code:

library(lsmeans) # found code from : https://stats.stackexchange.com/questions/33013/what-test-can-i-use-to-compare-slopes-from-two-or-more-regression-models

d.interaction <- lm(y ~ x*d, data = df) 
anova(d.interaction)

#obtain slopes
d.interaction$coefficients
d.lst <- lstrends(d.interaction, "d", var="x")

#compare slopes
pairs(d.lst)

However, I need a non-parametric test. I was told a Kruskal-Wallis test would be the test I need but I cannot for the life of me to find a code that does something similar to this ANOVA code above does (specifically finding a p-values for each decade slope compared to one another). If I have to filter the data to only compare two decade slopes at a time, I'm ok doing that too (I can easily do that with my data).

Does anyone have an idea as to how I can achieve this? If you have any questions or need further explanation, please don't hesitate to ask!

Thanks in advance!

Not sure, but is this what you're looking for?

require(lsmeans)
#> Loading required package: lsmeans
#> Loading required package: emmeans
#> The 'lsmeans' package is now basically a front end for 'emmeans'.
#> Users are encouraged to switch the rest of the way.
#> See help('transition') for more information, including how to
#> convert old 'lsmeans' objects and scripts to work with 'emmeans'.

# avoid df and data, because some operations give precedence to the 
# built-in functions of the same name, treating them as closures\
# no need to quote names of variables "x" = 

DF <- data.frame(
  d = c(
    "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1",
    "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1",
    "D1", "D1", "D1", "D1", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2",
    "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2",
    "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3",
    "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3",
    "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3",
    "D3"
  ),
  x = c(
    1.560, 0.618, 1.350, 1.610, 0.984, 20.800, 6.300, 0.221, 6.880, 0.784, 0.444, 11.600, 0.986, 2.600, 3.380, 1.170, 7.550,
    17.000, 29.200, 1.750, 93.000, 5.740, 2.630, 38.500, 7.190, 3.330, 33.500, 12.200, 7.950, 4.610, 29.800, 2.730, 12.400,
    4.410, 1.690, 3.350, 6.410, 1.210, 51.200, 0.232, 0.171, 8.550, 0.330, 1.330, 68.500, 0.642, 0.485, 14.400, 22.000, 5.010,
    13.700, 20.200, 9.870, 184.000, 13.800, 5.230, NA, 62.400, 17.400, 0.940, NA, 34.800, 5.250, 0.625, NA, 118.000, 3.320,
    4.420, NA, 10.900, 20.000, 93.100, 32.300, 7.060, 22.600, NA, 74.200, 22.600, 17.100, 6.230, 2.580, 1.680, 1.200, NA, 31.100,
    19.100, 83.300, 12.300, 14.200, 7.600, 3.840, 2.370, NA, 5.610, 51.900, 9.950, 13.400, 6.640, 1.690, 2.120, NA, NA, 45.900,
    2.520, NA, 69.000, 1.280, 3.750, NA, 81.800, 9.360, 31.400, NA, 30.100, 9.940, 1.740, 104.000, 10.200, 3.080, NA
  ),
  y = c(
    0.025, 0.075, 0.270, 0.045, 0.045, 0.345, 0.080, 0.030, 0.115, 0.072, 0.025, 0.177, 0.025, 0.039, 0.123,
    0.076, 0.041, 0.471, 0.378, 0.019, 0.599, 0.031, 0.012, 0.251, 0.047, 0.031, 0.713, 0.159, 0.030, 0.096,
    0.306, 0.027, 0.124, 0.050, 0.029, 0.051, 0.115, 0.020, 0.654, 0.049, 0.037, 0.170, 0.063, 0.039, 0.977,
    0.022, 0.049, 0.241, 0.250, 0.045, 0.219, 0.203, 0.149, 1.11, 0.093, 0.088, 0.061, 1.10, 0.059, 0.035,
    0.017, 0.656, 0.071, 0.019, 0.025, 0.323, 0.042, 0.016, 0.037, 0.166, 0.355, 0.321, 0.268, 0.066, 0.124,
    0.039, 0.252, 0.094, 0.115, 0.038, 0.028, 0.058, 0.035, 0.044, 0.447, 0.155, 0.474, 0.070, 0.061, 0.047,
    0.025, 0.012, 0.030, 0.121, 0.309, 0.039, 0.109, 0.035, 0.036, 0.016, 0.015, 0.547, 0.277, 0.021, 0.010,
    0.969, 0.123, 0.023, 0.020, 0.641, 0.129, 0.160, 0.029, 0.211, 0.043, 0.022, 1.28, 0.088, 0.015, 0.010
  )
)

d.interaction <- 
  lm(y ~ x*d, data = DF) 
anova(d.interaction)
#> Analysis of Variance Table
#> 
#> Response: y
#>            Df Sum Sq Mean Sq  F value Pr(>F)    
#> x           1 4.4615  4.4615 188.7313 <2e-16 ***
#> d           2 0.0565  0.0282   1.1949 0.3070    
#> x:d         2 0.0382  0.0191   0.8076 0.4488    
#> Residuals 101 2.3876  0.0236                    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#obtain slopes
d.interaction$coefficients
#>   (Intercept)             x           dD2           dD3         x:dD2 
#>  0.0565518400  0.0077897312  0.0383699010 -0.0458228089 -0.0014668737 
#>         x:dD3 
#> -0.0002164899
d.lst <- lstrends(d.interaction, "d", var="x")

#compare slopes
pairs(d.lst)
#>  contrast  estimate      SE  df t.ratio p.value
#>  D1 - D2   0.001467 0.00165 101  0.887  0.6498 
#>  D1 - D3   0.000216 0.00172 101  0.126  0.9913 
#>  D2 - D3  -0.001250 0.00111 101 -1.122  0.5027 
#> 
#> P value adjustment: tukey method for comparing a family of 3 estimates


kruskal.test(d ~ x, data = DF) 
#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  d by x
#> Kruskal-Wallis chi-squared = 103.35, df = 104, p-value = 0.4996

Created on 2021-01-10 by the reprex package (v0.3.0.9001)

Thanks for your response.

I'll be honest, stats is not my strong suit. Does Kruskal-Wallis test look at the x-y slope, not just the x values?

I want to specifically look at the p-value for each regression comparison, e.g. p-value for D1 - D2, D1 - D3, and D2 - D3. From searching online, I've seen the regression coefficient and constants compared between two slopes but no R coding for it or a non-parametric test.

Is this even possible?

Thanks.

Sorry for the delay. I'd like to back up from how to what because school algebra, f(x) = y provides a useful framework.

The three objects, f, x, y (in R, everything is an object) are

x, what is at hand, in this case the data frame DF
y, what is desired, to discuss
f, the function to convert x to y

Objects in R are often composite—in particular, functions are often composed, as in f(g(h)

But, rather than focusing for now on f, let's think about y.

We have a variable within DF, y and we are interested in the association between y and the variables within DF x (continuous) and d (catgegorical). One measure of association is provided by the f lm.

Neglecting d for a moment, we can look an ordinary least squares (linear) regression of y on x

DF <- data.frame(
  d = as.factor(c(
    "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1",
    "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1",
    "D1", "D1", "D1", "D1", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2",
    "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2",
    "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3",
    "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3",
    "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3",
    "D3"
  )),
  x = c(
    1.560, 0.618, 1.350, 1.610, 0.984, 20.800, 6.300, 0.221, 6.880, 0.784, 0.444, 11.600, 0.986, 2.600, 3.380, 1.170, 7.550,
    17.000, 29.200, 1.750, 93.000, 5.740, 2.630, 38.500, 7.190, 3.330, 33.500, 12.200, 7.950, 4.610, 29.800, 2.730, 12.400,
    4.410, 1.690, 3.350, 6.410, 1.210, 51.200, 0.232, 0.171, 8.550, 0.330, 1.330, 68.500, 0.642, 0.485, 14.400, 22.000, 5.010,
    13.700, 20.200, 9.870, 184.000, 13.800, 5.230, NA, 62.400, 17.400, 0.940, NA, 34.800, 5.250, 0.625, NA, 118.000, 3.320,
    4.420, NA, 10.900, 20.000, 93.100, 32.300, 7.060, 22.600, NA, 74.200, 22.600, 17.100, 6.230, 2.580, 1.680, 1.200, NA, 31.100,
    19.100, 83.300, 12.300, 14.200, 7.600, 3.840, 2.370, NA, 5.610, 51.900, 9.950, 13.400, 6.640, 1.690, 2.120, NA, NA, 45.900,
    2.520, NA, 69.000, 1.280, 3.750, NA, 81.800, 9.360, 31.400, NA, 30.100, 9.940, 1.740, 104.000, 10.200, 3.080, NA
  ),
  y = c(
    0.025, 0.075, 0.270, 0.045, 0.045, 0.345, 0.080, 0.030, 0.115, 0.072, 0.025, 0.177, 0.025, 0.039, 0.123,
    0.076, 0.041, 0.471, 0.378, 0.019, 0.599, 0.031, 0.012, 0.251, 0.047, 0.031, 0.713, 0.159, 0.030, 0.096,
    0.306, 0.027, 0.124, 0.050, 0.029, 0.051, 0.115, 0.020, 0.654, 0.049, 0.037, 0.170, 0.063, 0.039, 0.977,
    0.022, 0.049, 0.241, 0.250, 0.045, 0.219, 0.203, 0.149, 1.11, 0.093, 0.088, 0.061, 1.10, 0.059, 0.035,
    0.017, 0.656, 0.071, 0.019, 0.025, 0.323, 0.042, 0.016, 0.037, 0.166, 0.355, 0.321, 0.268, 0.066, 0.124,
    0.039, 0.252, 0.094, 0.115, 0.038, 0.028, 0.058, 0.035, 0.044, 0.447, 0.155, 0.474, 0.070, 0.061, 0.047,
    0.025, 0.012, 0.030, 0.121, 0.309, 0.039, 0.109, 0.035, 0.036, 0.016, 0.015, 0.547, 0.277, 0.021, 0.010,
    0.969, 0.123, 0.023, 0.020, 0.641, 0.129, 0.160, 0.029, 0.211, 0.043, 0.022, 1.28, 0.088, 0.015, 0.010
  )
)

fit <- lm(y ~ x, data = DF)

summary(fit)
#> 
#> Call:
#> lm(formula = y ~ x, data = DF)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.54542 -0.05770 -0.03207  0.01674  0.61532 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 0.0540089  0.0176702   3.057  0.00284 ** 
#> x           0.0069018  0.0005024  13.738  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.1538 on 105 degrees of freedom
#>   (13 observations deleted due to missingness)
#> Multiple R-squared:  0.6425, Adjusted R-squared:  0.6391 
#> F-statistic: 188.7 on 1 and 105 DF,  p-value: < 2.2e-16

par(mfrow = c(2,2))
plot(fit)

We will take fit at face value, even though the diagnostics show that underlying assumptions of linear regression are unsatisfied for these data. (That's another discussion; here, we are concerned with a different aspect.) If the model did satisfy the requirements, we could read summary(fit) to find

  1. The F-statistic for the model has a vanishingly low probability (p-value) of being observed due to random variation.
  2. The contribution of the x variable, likewise, has a p-value of its test statistic that has an equally low likelihood of being due to random variation.
  3. The line that passes through the points on the x//y axis has a given intercept on the y axis with a slope that can be calculated with the estimate of x

However, this ignores d—the x points can be classified into three d buckets. Does it matter which?

One way to look at it is to add d to the formula in lm. (The following omits DF from the reprex for brevity.)

# USE DF from first reprex
fit <- lm(y ~ x + d, data = DF)

summary(fit)
#> 
#> Call:
#> lm(formula = y ~ x + d, data = DF)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.57710 -0.05713 -0.02887  0.01338  0.58666 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.0650519  0.0254412   2.557    0.012 *  
#> x            0.0069560  0.0005143  13.526   <2e-16 ***
#> dD2          0.0142324  0.0393399   0.362    0.718    
#> dD3         -0.0399107  0.0350181  -1.140    0.257    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.1535 on 103 degrees of freedom
#>   (13 observations deleted due to missingness)
#> Multiple R-squared:  0.6507, Adjusted R-squared:  0.6405 
#> F-statistic: 63.95 on 3 and 103 DF,  p-value: < 2.2e-16

In this iteration, the F-statistic p-value indicates that at least one of the parameters is useful compared to a model with no parameters, and the p-values for the t-values show that only x is (the unfortunately named) significant. Knowing whether d is type D2 or D3 provides no useful information.

Likewise for interactions. ( (The following omits DF from the reprex for brevity.)

fit <- lm(y ~ x*d, data = DF)

summary(fit)
#>
#> Call:
#> lm(formula = y ~ x * d, data = DF)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.51802 -0.05758 -0.02409 0.01314 0.61053
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.0565518 0.0292031 1.937 0.0556 .
#> x 0.0077897 0.0014898 5.229 9.26e-07 ***
#> dD2 0.0383699 0.0452015 0.849 0.3980
#> dD3 -0.0458228 0.0425569 -1.077 0.2842
#> x:dD2 -0.0014669 0.0016540 -0.887 0.3773
#> x:dD3 -0.0002165 0.0017161 -0.126 0.8999
#> ---
#> Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1538 on 101 degrees of freedom
#> (13 observations deleted due to missingness)
#> Multiple R-squared: 0.6562, Adjusted R-squared: 0.6391
#> F-statistic: 38.55 on 5 and 101 DF, p-value: < 2.2e-16

This is similar. We could go on to subset `DF` to include only two of  `D1, D2 or D3`, but the results would be similar; the information about **d** provides no insight as to **y**.  And  neither`anova` nor `aov` are going to help either for the reason alluded to above&mdash;the residuals are not normally distributed. (See the `qqplot` in the first `reprex`.)

And *that* is the reason we need a non-parametric test&mdash;Kruskal-Wallis is the one we are looking at, but there are [many, many more](https://cran.r-project.org/web/packages/PMCMRplus/vignettes/QuickReferenceGuide.html)
(See the `qqplot` in the first `reprex`.)

``` r
kruskal.test(y ~ d, data = DF) 
#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  y by d
#> Kruskal-Wallis chi-squared = 0.89347, df = 2, p-value = 0.6397

What does this tell us?

kruskal.test performs a Kruskal-Wallis rank sum test of the null that the location parameters of the distribution of x are the same in each group (sample). The alternative is that they differ in at least one.

So, at least one of D1, D2 or D3 has a different location parameter. (We should have seen this coming). The following omits DF from the reprex for brevity.

boxplot(y ~ d, data = DF)

In formal terms, we reject the null hypothesis, H_0 that the d groups are the same.

How do we turn this into something we can add to an x-y plot? Not in a way that makes sense unless we want to plot the three lm models of y˜x. Going back to visualization will help. The following omits DF from the reprex for brevity.


ggplot(DF,aes(x,y,color = d)) + geom_smooth(method = "lm", se = FALSE) + theme_minimal()
#> `geom_smooth()` using formula 'y ~ x'
#> Warning: Removed 13 rows containing non-finite values (stat_smooth).