R code looping through two samples

I am trying to write an R code which looks through sample1 and sample 2 then find the probability of values in sample 2 which greater than values in sample 2. This is my code but i think i am missing something:

counter = 0
for (i in 1:30){
  if Length(sample1(sample2[i]>sample1])/length(sample1)
    counter = counter+1
  }

counter/30

data.sample

sample1 =c(103.85717,
103.14916,
113.90591,
104.91262,
90.84991,
97.60370,
101.22786,
102.18647,
97.69707,
95.88739,
109.02683,
92.12523,
106.21427,
98.92642,
90.83309,
93.55331,
102.26012,
99.50595,
109.58390,
99.52911,
106.94329,
98.57266,
95.04966,
94.62632,
105.90799,
103.72201,
98.21947,
99.34361,
103.06769,
105.33094)

sample2 <- c(84.91725,
100.62723,
88.79755,
104.33116,
98.37728,
97.46470,
94.56373,
96.59360,
95.56268,
90.70722,
90.80393,
102.37638,
100.52961,
99.52056,
106.59802,
85.82843,
92.39860,
100.09725,
98.49902,
103.80920,
92.75619,
108.97948,
107.33424,
94.47826,
107.20983,
97.42251,
91.63524,
95.62785,
98.37055,
88.43078)

Your code has a couple of syntax problems.

  1. There is no Length() function. Use length()
  2. The argument of if() has to be withing parentheses.
counter = 0
for (i in 1:30) {
  if (length(sample1[sample2[i]>sample1])/length(sample1) )
    counter = counter+1
}

counter/30

Thanks for this! I get 0.8 i am not sure if this correct.I think i need to loop through each sample then run a comparison. This is where i am stuck.

I am not sure what you are trying to calculate. The 0.8 is the fraction of elements in sample2 that are greater than at least one element in sample1. Another way to look at that is that the 6 smallest elements of sample2 are smaller than all of the elements of sample1. Is that what you want? If not, please explain your goal again.

I am trying to find a probability that if a random number was drawn from sample 1 and 2 that sample 2 will be higher has a higher number than sample 1. i don't think the code quite does this? Thanks

This code returns a vector that shows the fraction of sample1 that is smaller than each member of sample2. I used your bit of code from the if() statement to calculate that fraction, though there are other ways to do it. Can you work from there to determine what the probability is of sample2 > sample1 from a random draw?

sample1 =c(103.85717, 103.14916,113.90591, 104.91262,90.84991, 97.60370, 101.22786,
           102.18647,97.69707,95.88739, 109.02683, 92.12523, 106.21427, 98.92642,
           90.83309, 93.55331, 102.26012, 99.50595, 109.58390, 99.52911,
           106.94329, 98.57266, 95.04966, 94.62632,105.90799, 103.72201, 98.21947,
           99.34361, 103.06769, 105.33094)

sample2 <- c(84.91725,100.62723,88.79755,104.33116,98.37728,97.46470,94.56373,
             96.59360,95.56268,90.70722,90.80393,102.37638,100.52961,99.52056,
             106.59802,85.82843,92.39860,100.09725,98.49902,103.80920,92.75619,
             108.97948,107.33424,94.47826,107.20983,97.42251,91.63524,95.62785,
             98.37055, 88.43078)

counter = vector(mode = "numeric", length = 30)
for (i in 1:30) {
    counter[i] = length(sample1[sample2[i]>sample1])/length(sample1) 
}
counter
#>  [1] 0.00000000 0.50000000 0.00000000 0.73333333 0.33333333 0.23333333
#>  [7] 0.13333333 0.23333333 0.20000000 0.00000000 0.00000000 0.60000000
#> [13] 0.50000000 0.46666667 0.86666667 0.00000000 0.10000000 0.50000000
#> [19] 0.33333333 0.70000000 0.10000000 0.90000000 0.90000000 0.13333333
#> [25] 0.90000000 0.23333333 0.06666667 0.20000000 0.33333333 0.00000000

Created on 2019-10-23 by the reprex package (v0.3.0.9000)

I think this might do what you are hoping, unless I am not interpreting correctly...

tidyr::expand_grid(sample1, sample2) %>% 
  mutate(sample2_larger = if_else(sample2>sample1, 1, 0)) %>% 
  summarise(prob = mean(sample2_larger))

Hi @user124578,

Here is your data:

sample1 = c(103.85717, 103.14916, 113.90591, 104.91262, 90.84991, 97.60370,
            101.22786, 102.18647, 97.69707, 95.88739, 109.02683, 92.12523,
            106.21427, 98.92642, 90.83309, 93.55331, 102.26012, 99.50595,
            109.58390, 99.52911, 106.94329, 98.57266, 95.04966, 94.62632,
            105.90799, 103.72201, 98.21947, 99.34361, 103.06769, 105.33094)

sample2 = c(84.91725, 100.62723, 88.79755, 104.33116, 98.37728, 97.46470,
            94.56373, 96.59360, 95.56268, 90.70722, 90.80393, 102.37638,
            100.52961, 99.52056, 106.59802, 85.82843, 92.39860, 100.09725,
            98.49902, 103.80920, 92.75619, 108.97948, 107.33424, 94.47826,
            107.20983, 97.42251, 91.63524, 95.62785, 98.37055, 88.43078)

Assuming unpaired observations, I would apply bootstrapping like so:

> set.seed(785073)
> mean( sample(x = sample1, size = 1000, replace = TRUE) > sample(x = sample2, size = 1000, replace = TRUE) )
[1] 0.657

And if you want to account for seed differences,

> set.seed(785073)
> x = replicate(1000, mean( sample(x = sample1, size = 1000, replace = TRUE) > sample(x = sample2, size = 1000, replace = TRUE) ))
> mean(x)
[1] 0.660379
> sd(x)
[1] 0.0147705

So the probability of a randomly drawn value s_{1,i} from sample1 being larger than a randomly drawn value s_{2,j} in sample2 is P(s_{1,i} > s_{2,j}) \sim 66.0\% \pm 1.50\%

If you are really curious and want to pursue this further, then take a look at Receiver Operating Characteristic

Hope it helps :slightly_smiling_face:

Ps. if you are looking for the probability of observing sample1 and sample2, given that we assume that they are drawn from the same population, then depending on the data either a paired or an unpaired t-test would do the trick :+1:

1 Like

Many Thanks for this. I am looking for a random value choosen from each sample. I have some question about the code please. It's unclear to me. Why did you use seed, and sample and also set the size to 1000( i thought that would be 30 based on the sample size). Thanks

Hi @user124578,

Bootstrapping is sampling with replacement. Try to google set.seed or in the console try to issue the command ?set.seed. I suggest you try to play around with the code I gave you to see what's what. If I dissect everything for you, basically you'll learn very little. Yes, it'll take a longer time on your side, but your learning yield will be much greater :+1:

2 Likes

I am still a little confused about why there is a need for resampling, the code I provided above gives you the empirical probability of a random value from sample 2 being larger than a random value from sample 1.

You can do this by setting up a pair-wise grid of all the values from both sets. Then compare each pair of values and compute the mean. The result is an exact probability which is nearly same as @Leon's answer (difference of 0.003 and 0.0004 after replication), but without the need for any resampling. In fact, an answer based on resampling will asymptotically converge to the empirical probability provided by my code:

tidyr::expand_grid(sample1, sample2) %>% 
  mutate(sample2_larger = if_else(sample2>sample1, 1, 0)) %>% 
  summarise(prob = mean(sample2_larger))

Am I misinterpreting the question?

What is the expand grid? When running your code to check it doesn't seem to work i have an error message: Error: 'expand_grid' is not an exported object from 'namespace:tidyr'
Does it need a specfic library?

You may need the development version of tidyr which can be downloaded using this code:

remotes::install_github('tidyverse/tidyr')

Alternatively, you can just use the base R near-equivalent expand.grid()

expand.grid(sample1 = sample1, sample2 = sample2) %>% 
  mutate(sample2_larger = if_else(sample2>sample1, 1, 0)) %>% 
  summarise(prob = mean(sample2_larger))

Thanks! I cannot do the remotes::install_github('tidyverse/tidyr')

I can run the code using R base. The value that your code produce is different to @Leon solution. But, i think Leon solution is correct from using a different method as i am getting the same value for both.

The value my code produces is the probability you requested in the above comment. Leon's code produces the complementary probability (probability of sample 1 > sample 2). My code produces 0.34 and his code produces ~0.66 (i.e. 1-0.34).

1 Like

title: "Area Under the Curve Speed Test"

Benchmarking Area Under the Curve

Background

I had a very similar problem only last month. I created this R Markdown Notebook to track and test answers from a recent discussion on StackOverflow, where I asked:

I want to estimate the likelihood that a randomly-selected item from one group will have a higher score than a randomly-selected item from a different group. That is, the Probability of Superiority, sometimes called the Common-language Effect Size. See for example: https://rpsychologist.com/d3/cohend/. This can be resolved algebraically if we accept that the data are normally distributed McGraw and Wong (1992, Psychological Bulletin, 111, doi: 10.1037/0033-2909.111.2.361), but I know that my data are not normally distributed, making such an estimate unreliable.

The resulting discussion produced several alternatives, most of which were a great deal better than my first efforts.

This file runs benchmarking tests to see the relative speed at which each algorithm works with different sample sizes.

Setup

Load required libraries

# for analysis 
library(data.table)
library(bigstatsr)

# for plotting
library(ggplot2) # pretty plots
library(scales)  # commas in large numbers

# for recording & comparing runtimes
library(microbenchmark)

# for output to PDF & HTML
library(tinytex)

options(width = 90)

Functions to test

Eight alternative functions are benchmarked against each other and compared using a prespecified number of calculations with 100 iterations. Summary tables are presented for absolute and relative times. We then present these results with violin plots.

Functions are:

  1. My Original For-If-Else function
  2. Bring data together with expandgrid
  3. "CJ" (C)ross (J)oin. A data.table is formed from the cross product of the vectors
  4. Modified expandgrid function using similar sum routine as CJ
  5. DataTable cartesian product of data.table
  6. Wilcox from base R
  7. Modified Wilcox from base R
  8. BaseR AUC, taking advantage of properties of matrix outer product
  9. AUC from BigStatsR

My Original For-If-Else function


MyForIfElse.function <- function(){
  bigger  <- 0
  equal   <- 0.0
  smaller <- 0

  for (i in alpha) {
    for (j in beta) {
      if (i > j) {bigger <- bigger + 1}
      else
        if (i == j) {equal <- equal + 0.5}
    }
  }
  PaGTb <- (bigger + equal) / nPairs
  PaGTb
}

Expand.grid function

Myexpandgrid.function <- function(){
  # My second attempt
  c <- expand.grid(a = alpha, b = beta)
  aGTb <- length(which(c$a > c$b))
  aEQb <- length(which(c$a == c$b))
  aGTb <- aGTb + (0.5 * aEQb)
  PaGTb <- aGTb / nPairs
  PaGTb
}

Cross Join function


MyCJ.function <- function(){
  data.table::CJ(a = alpha, b = beta, sorted = F)[, {
    aGTb = sum(a > b)
    aEQb = sum(a == b)
    aGTb = aGTb + 0.5 * aEQb
    PaGTb = aGTb / nPairs #data.table returns last value in {}
    } ]
}

bigstatsr AUC function


MyAUC.function <- function() {
  MyAUC <-
    bigstatsr::AUC(c(alpha, beta), rep(1:0, c(length(alpha), length(beta))))
  MyAUC
}

DataTable function


MyDT.function <- function() {
  DTa <- data.table(a = alpha)[, .N, keyby = a]
  DTb <- data.table(b = beta)[, .N, keyby = b]
  MyDT <- (DTb[DTa, on = .(b < a), allow.cartesian = TRUE, 
               nomatch = 0L, sum(N*i.N)] +
      0.5 * DTb[DTa, on = .(b = a), allow.cartesian = TRUE, 
                nomatch = 0L, sum(N*i.N)]) / nPairs
  MyDT
}

Wilcoxon test function


MyWilcox.function <- function() {
  MyWilcox <- wilcox.test(alpha, beta)$statistic / nPairs
  MyWilcox
}

Modified Wilcox


wilcox_AUC <- function(){
  r <- rank(c(alpha, beta))
  
  n.x <- as.integer(length(alpha))
  n.y <- as.integer(length(beta))
  
  {sum(r[seq_along(alpha)]) - n.x * (n.x + 1) / 2} / n.x / n.y
}

Base AUC


AUC2 <- function() {
# tabulate does not include 0, therefore we need to
# normalize to positive values
  m <- min(c(alpha, beta))

  A_Freq = tabulate(alpha - min(m) + 1)
  B_Freq = tabulate(beta - min(m) + 1)

# calculate the outer product.  
  out_matrix <- outer(A_Freq, B_Freq)
  
  bigger = sum(out_matrix[lower.tri(out_matrix)])
  equal = 0.5 * sum(diag(out_matrix))
  (bigger + equal) / length(alpha) / length(beta)
}

Benchmark all of the Functions


MyBenchmark.function <- function(x) {
  mbm <<- microbenchmark(
    "ForIfElse"   = {MyForIfElse.function()},
    "expandgrid"  = {Myexpandgrid.function()},
    "CrossJoin"   = {MyCJ.function()},
    "CartesianDT" = {MyDT.function()},
    "Wilcox"      = {MyWilcox.function()},
    "BaseWilcox"  = {wilcox_AUC()},
    "BigStatsAUC" = {MyAUC.function()},
    "BaseAUC"     = {AUC2()},
    times = x,
    unit = "ns"
  )
}

Charting functions

Function to visualise the data distributions


MyDistPlot <- function(){
  a <- data.frame(val = alpha, id = "alpha")
  b <- data.frame(val = beta, id = "beta")
  c <- rbind(a,b)
  
  p2 <- ggplot(data = c, aes(x = val, group = id, fill = id)) +
    geom_density(adjust = 1.5, alpha = .2) +  
    theme_classic() +
    labs(title = NULL , 
         subtitle = paste("Sample sizes = ", MyRows),
         caption = paste("AUC = Alpha:", round(PaGTb, 2) * 100, 
                         ", Beta:", round(1 - PaGTb, 2) * 100,
                         "
                         Note truncated x-axis: Values can range ± 30")
    ) +
    theme(legend.position = c(0.9, 0.9)) +
    xlim(-8, 8)
  p2
}

Function to plot results of simulation


MyPlot.function <- function(){
  pl <- ggplot(mbm, aes(expr, time)) +
    geom_violin(scale = "width", trim = TRUE, 
                fill = "blue", alpha = 0.1, 
                colour = "#999999") +
    geom_jitter(height = 0, width = 0.2, 
                alpha = 0.1, 
                colour = "navy") +
    theme_classic() +
    scale_y_log10() +
    labs(title = "Speed-test: alternative Probability of Superiority calculations",
         subtitle = paste(comma(MyRows), " x ", 
                          comma(MyRows), " = ", 
                          comma(nPairs), "comparisons"),
         y = "Nanoseconds (log10)" , x = NULL ) +
    coord_flip()
  pl
}


Generate some artificial data

Generating data

In this example, two non-normal data sets are compared, so it's unclear which should have higher probability of superiority. Group Alpha is more platykurtic (spread out), with many large negative and positive values, while group Beta is a $\chi$$^2$ distribution (df=1), so highly skewed.


# for reproducability
 set.seed(42)

# shifted normal distribution cubed
MakeAlpha <- function(){rnorm(MyRows, mean = 0, sd = 1)**3}
# normal distribution squared and shifted
MakeBeta <- function(){rnorm(MyRows, mean = 0, sd = 1)**2 - 0.6}


Run the simulation

Change the number in MyRows for different sample sizes

Note: Large sample size can take a long time.


MyRows <- 25

Complete simulation is run 100 times.

(Reduce the number for a faster result)


alpha <- MakeAlpha()
beta <- MakeBeta()
nPairs <- length(alpha) * length(beta)

# For interest, we calculate the sample Probability of Superiority 
# (Area Under the Curve)
PaGTb <- AUC2()
MyDistPlot() 

Print summary results and visualise benchmarks


MyBenchmark.function(100)

# print comparison speed test table
print(mbm, unit = "ms")
print(mbm, unit = "relative")

# Visualise the comparison
MyPlot.function() 

Interpretation of small sample experiment

The original ForIfElse algorithm seems to work about as well as, or better than, the other options.

The Modified Wilcox and BaseR_AUC algorithm seem to perform best most often.


Rerun with larger sample size


MyRows <- 1000

 set.seed(42)

alpha <- MakeAlpha()
beta <- MakeBeta()
nPairs <- length(alpha) * length(beta)

PaGTb <- MyAUC.function()
MyDistPlot() 

MyBenchmark.function(100)

# print comparison benchmark tables
print(mbm, unit = "ms")
print(mbm, unit = "relative")

MyPlot.function() 

Interpretation of larger sample experiment

The Modified Wilcox, BigStatsR AUC and BaseR AUC algorithm seem to perform best most often.

ForIfElse alfgorithm is among the poorer performers now


Rerun with much larger sample size

This can take some time ... You may want to go make a cup of tea while it's running


MyRows <- 5000

 set.seed(42)

alpha <- MakeAlpha()
beta <- MakeBeta()
nPairs <- length(alpha) * length(beta)

PaGTb <- MyAUC.function()
MyDistPlot() 

MyBenchmark.function(100)

# print comparison benchmark tables
print(mbm, unit = "ms")
print(mbm, unit = "relative")

MyPlot.function() 

Interpretation of larger sample experiment

The BaseR_AUC, followed by Modified Wilcox and BigStatsR AUC seem to perform best most often.

BaseR_AUC functions at about 5000 times the speed of the ForIfElse algorithm.


Conclusions

The Area Under the Curve (AUC) function from the package, bigstatsr (Statistical Tools for Filebacked Big Matrices) by Florian Privé (Grenoble, France) was extraordinarily efficient.

The two close winners were created by Cole, who prefers to remain anonymous, with significant input from chinsoon12 (Singapore). These were:

  • a Wilcox test extracted from the original R-code to focus only on the single task, rather than a general statistical routine.
  • an AUC function that relies on table()

Both algorithms are attractive because they rely only on Base R.


1 Like