This problem is what caret::resamples and (even better) tidyposterior are focused on.
You might want to stay away from confidence intervals and go Bayesian (which is not difficult here). This paper (pdf) does a good job explaining why you would want to do that, although I think that their statistical model is to prescriptive. tidyposterior makes it easy to make real probability statements about the differences in models. Using confidence intervals, you can't do that because of how they are created.
Here's an example with your data. You can run
library(dplyr)
library(tidyposterior)
cv_result <- tribble(
~id, ~Algorithm_1, ~Algorithm_2,
"1", 91.11, 90.7,
"2", 90.48, 90.52,
"3", 91.87, 90.88,
"4", 90.52, 90.87,
"5", 89.88, 90.02,
"6", 89.77, 88.99,
"7", 91.44, 90.98,
"8", 90.88, 91.44,
"9", 90.77, 90.77,
"10", 90.89, 90.92
)
bayes_model <- perf_mod(cv_result, seed = 3806)
to get a model to compares the two algorithms and then use the summary methods to get the probability statements:
> # Credible intervals for each model
> bayes_model %>% tidy() %>% summary()
# A tibble: 2 x 4
model mean lower upper
<chr> <dbl> <dbl> <dbl>
1 Algorithm_1 12.7 1.66 24.4
2 Algorithm_2 12.6 1.81 24.3
>
> # Results on the difference in performance
> contrast_models(bayes_model, seed = 3451) %>% summary()
# A tibble: 1 x 9
contrast probability mean lower upper size pract_neg pract_equiv pract_pos
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Algorithm_1 vs Algorithm_2 0.528 0.0633 -1.80 1.90 0. NA NA NA
> # The prob that model 1 is better than model 2 is 52.8% (assuming that lower is better)
The ROPE estimates described in the paper can be used via the size argument to the summary method on contrast_models (see the website for examples).