Sorry for the delay. I'd like to back up from how to what because school algebra, f(x) = y provides a useful framework.
The three objects, f, x, y (in R, everything is an object) are
x, what is at hand, in this case the data frame DF
y, what is desired, to discuss
f, the function to convert x to y
Objects in R are often composite—in particular, functions are often composed, as in f(g(h)
But, rather than focusing for now on f, let's think about y.
We have a variable within DF, y and we are interested in the association between y and the variables within DF x (continuous) and d (catgegorical). One measure of association is provided by the f lm.
Neglecting d for a moment, we can look an ordinary least squares (linear) regression of y on x
DF <- data.frame(
d = as.factor(c(
"D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1",
"D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1", "D1",
"D1", "D1", "D1", "D1", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2",
"D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2", "D2",
"D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3",
"D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3",
"D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3", "D3",
"D3"
)),
x = c(
1.560, 0.618, 1.350, 1.610, 0.984, 20.800, 6.300, 0.221, 6.880, 0.784, 0.444, 11.600, 0.986, 2.600, 3.380, 1.170, 7.550,
17.000, 29.200, 1.750, 93.000, 5.740, 2.630, 38.500, 7.190, 3.330, 33.500, 12.200, 7.950, 4.610, 29.800, 2.730, 12.400,
4.410, 1.690, 3.350, 6.410, 1.210, 51.200, 0.232, 0.171, 8.550, 0.330, 1.330, 68.500, 0.642, 0.485, 14.400, 22.000, 5.010,
13.700, 20.200, 9.870, 184.000, 13.800, 5.230, NA, 62.400, 17.400, 0.940, NA, 34.800, 5.250, 0.625, NA, 118.000, 3.320,
4.420, NA, 10.900, 20.000, 93.100, 32.300, 7.060, 22.600, NA, 74.200, 22.600, 17.100, 6.230, 2.580, 1.680, 1.200, NA, 31.100,
19.100, 83.300, 12.300, 14.200, 7.600, 3.840, 2.370, NA, 5.610, 51.900, 9.950, 13.400, 6.640, 1.690, 2.120, NA, NA, 45.900,
2.520, NA, 69.000, 1.280, 3.750, NA, 81.800, 9.360, 31.400, NA, 30.100, 9.940, 1.740, 104.000, 10.200, 3.080, NA
),
y = c(
0.025, 0.075, 0.270, 0.045, 0.045, 0.345, 0.080, 0.030, 0.115, 0.072, 0.025, 0.177, 0.025, 0.039, 0.123,
0.076, 0.041, 0.471, 0.378, 0.019, 0.599, 0.031, 0.012, 0.251, 0.047, 0.031, 0.713, 0.159, 0.030, 0.096,
0.306, 0.027, 0.124, 0.050, 0.029, 0.051, 0.115, 0.020, 0.654, 0.049, 0.037, 0.170, 0.063, 0.039, 0.977,
0.022, 0.049, 0.241, 0.250, 0.045, 0.219, 0.203, 0.149, 1.11, 0.093, 0.088, 0.061, 1.10, 0.059, 0.035,
0.017, 0.656, 0.071, 0.019, 0.025, 0.323, 0.042, 0.016, 0.037, 0.166, 0.355, 0.321, 0.268, 0.066, 0.124,
0.039, 0.252, 0.094, 0.115, 0.038, 0.028, 0.058, 0.035, 0.044, 0.447, 0.155, 0.474, 0.070, 0.061, 0.047,
0.025, 0.012, 0.030, 0.121, 0.309, 0.039, 0.109, 0.035, 0.036, 0.016, 0.015, 0.547, 0.277, 0.021, 0.010,
0.969, 0.123, 0.023, 0.020, 0.641, 0.129, 0.160, 0.029, 0.211, 0.043, 0.022, 1.28, 0.088, 0.015, 0.010
)
)
fit <- lm(y ~ x, data = DF)
summary(fit)
#>
#> Call:
#> lm(formula = y ~ x, data = DF)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.54542 -0.05770 -0.03207 0.01674 0.61532
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.0540089 0.0176702 3.057 0.00284 **
#> x 0.0069018 0.0005024 13.738 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1538 on 105 degrees of freedom
#> (13 observations deleted due to missingness)
#> Multiple R-squared: 0.6425, Adjusted R-squared: 0.6391
#> F-statistic: 188.7 on 1 and 105 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(fit)

We will take fit at face value, even though the diagnostics show that underlying assumptions of linear regression are unsatisfied for these data. (That's another discussion; here, we are concerned with a different aspect.) If the model did satisfy the requirements, we could read summary(fit) to find
- The F-statistic for the model has a vanishingly low probability (p-value) of being observed due to random variation.
- The contribution of the x variable, likewise, has a p-value of its test statistic that has an equally low likelihood of being due to random variation.
- The line that passes through the points on the x//y axis has a given intercept on the y axis with a slope that can be calculated with the estimate of x
However, this ignores d—the x points can be classified into three d buckets. Does it matter which?
One way to look at it is to add d to the formula in lm. (The following omits DF from the reprex for brevity.)
# USE DF from first reprex
fit <- lm(y ~ x + d, data = DF)
summary(fit)
#>
#> Call:
#> lm(formula = y ~ x + d, data = DF)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.57710 -0.05713 -0.02887 0.01338 0.58666
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.0650519 0.0254412 2.557 0.012 *
#> x 0.0069560 0.0005143 13.526 <2e-16 ***
#> dD2 0.0142324 0.0393399 0.362 0.718
#> dD3 -0.0399107 0.0350181 -1.140 0.257
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1535 on 103 degrees of freedom
#> (13 observations deleted due to missingness)
#> Multiple R-squared: 0.6507, Adjusted R-squared: 0.6405
#> F-statistic: 63.95 on 3 and 103 DF, p-value: < 2.2e-16
In this iteration, the F-statistic p-value indicates that at least one of the parameters is useful compared to a model with no parameters, and the p-values for the t-values show that only x is (the unfortunately named) significant. Knowing whether d is type D2 or D3 provides no useful information.
Likewise for interactions. ( (The following omits DF from the reprex for brevity.)
fit <- lm(y ~ x*d, data = DF)
summary(fit)
#>
#> Call:
#> lm(formula = y ~ x * d, data = DF)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.51802 -0.05758 -0.02409 0.01314 0.61053
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.0565518 0.0292031 1.937 0.0556 .
#> x 0.0077897 0.0014898 5.229 9.26e-07 ***
#> dD2 0.0383699 0.0452015 0.849 0.3980
#> dD3 -0.0458228 0.0425569 -1.077 0.2842
#> x:dD2 -0.0014669 0.0016540 -0.887 0.3773
#> x:dD3 -0.0002165 0.0017161 -0.126 0.8999
#> ---
#> Signif. codes: 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.1538 on 101 degrees of freedom
#> (13 observations deleted due to missingness)
#> Multiple R-squared: 0.6562, Adjusted R-squared: 0.6391
#> F-statistic: 38.55 on 5 and 101 DF, p-value: < 2.2e-16
This is similar. We could go on to subset `DF` to include only two of `D1, D2 or D3`, but the results would be similar; the information about **d** provides no insight as to **y**. And neither`anova` nor `aov` are going to help either for the reason alluded to above—the residuals are not normally distributed. (See the `qqplot` in the first `reprex`.)
And *that* is the reason we need a non-parametric test—Kruskal-Wallis is the one we are looking at, but there are [many, many more](https://cran.r-project.org/web/packages/PMCMRplus/vignettes/QuickReferenceGuide.html)
(See the `qqplot` in the first `reprex`.)
``` r
kruskal.test(y ~ d, data = DF)
#>
#> Kruskal-Wallis rank sum test
#>
#> data: y by d
#> Kruskal-Wallis chi-squared = 0.89347, df = 2, p-value = 0.6397
What does this tell us?
kruskal.test performs a Kruskal-Wallis rank sum test of the null that the location parameters of the distribution of x are the same in each group (sample). The alternative is that they differ in at least one.
So, at least one of D1, D2 or D3 has a different location parameter. (We should have seen this coming). The following omits DF from the reprex for brevity.
boxplot(y ~ d, data = DF)

In formal terms, we reject the null hypothesis, H_0 that the d groups are the same.
How do we turn this into something we can add to an x-y plot? Not in a way that makes sense unless we want to plot the three lm models of y˜x. Going back to visualization will help. The following omits DF from the reprex for brevity.
ggplot(DF,aes(x,y,color = d)) + geom_smooth(method = "lm", se = FALSE) + theme_minimal()
#> `geom_smooth()` using formula 'y ~ x'
#> Warning: Removed 13 rows containing non-finite values (stat_smooth).
