Correctly Writing a Double Loop in R

I have the following dataset:

library(dplyr)
library(purrr)
set.seed(123)
my_data1 = data.frame(var1 =  rnorm(500,100,100), prop = runif(500, min=0, max=0.5))
my_data2 = data.frame(var1 = rnorm(500, 200, 50), prop = runif(500, min=0.4, max=0.7))
my_data = rbind(my_data1, my_data2)

I made ten equal-sized bins, found the average proportion within each bin, then plotted the results:

final = my_data %>%
arrange(var1) %>%
mutate(ntile = ntile(var1, 10)) %>%
group_by(ntile) %>%
summarise(mean = mean(prop))

plot(final$ntile, final$mean, type = "l", xlab = "Bin Number", ylab = "Average Proportion", main = "Relationship Between Bins and Average Proportion")

Now, suppose I:

  • Take a random 70% sample, create 10 bins and calculate the mean proportion within each of these bins
  • Then, from the remaining 30% sample - I again take a random sample and create 10 bins and calculate the mean proportion within each of the bins
  • Next, I take the squared sum of the difference between these two steps for each bin
  • Finally, repeat this process many times (while tracking the results)

Below, I wrote some R code for this procedure:

base = sample_n(my_data, 700)

base_comp = base %>%
            arrange(var1) %>%
            mutate(ntile = ntile(var1, 10)) %>%
            group_by(ntile) %>%
            summarise(mean = mean(prop))


sampling_frame = my_data %>% anti_join(base)

my_list = list()

for (i in 1:1000)
    
{
    
    a_i = sample_n(sampling_frame, 100)
    
    base_a_i = a_i %>%
    arrange(var1) %>%
    mutate(ntile = ntile(var1, 10)) %>%
    group_by(ntile) %>%
    summarise(mean = mean(prop))
    
    sum_i <- sum(map2_dbl(base_a_i$mean, base_comp$mean, function(x, y) (x - y)^2))
    
    my_list[[i]] <- data.frame(id = i, sum_i)
    print(data.frame(id = i, sum_i))
}

results = do.call(rbind.data.frame, my_list)

plot(density(results$sum_i), main = "Distribution of Deviations")
plot(results$id, results$sum_i, xlab = "Iteration", ylab = "Deviation", main = "Trace Plot", type = "b")

My Question: Now, I am trying to create a "double loop" - that is, I would to:

  • For j = 1, randomly create "base_j" , "base_comp_j" and "sampling_frame_j"
  • Repeat the "i" loop 100 times
  • Take the average of all "sum_i" for j=1 (i.e. average_sum_i_j)
  • Now, repeat this for j = 2
  • Repeat until j = 100
  • Each point on these two graphs will be "average_sum_i_j"

Here is my attempt to do this:

#  change i and j index to 10 for brevity

my_list = list()

for (j in 1:10) {
    base_j = sample_n(my_data, 700)
    base_comp_j = base_j %>%
        arrange(var1) %>%
        mutate(ntile = ntile(var1, 10)) %>%
        group_by(ntile) %>%
        summarise(mean = mean(prop))
    sampling_frame_j = my_data %>% anti_join(base_j)
    sum_i_list = list()
    for (i in 1:10) {
        a_i = sample_n(sampling_frame_j, 100)
        base_a_i = a_i %>%
            arrange(var1) %>%
            mutate(ntile = ntile(var1, 10)) %>%
            group_by(ntile) %>%
            summarise(mean = mean(prop))
        sum_i <- sum(map2_dbl(base_a_i$mean, base_comp_j$mean, function(x, y) (x - y)^2))
        sum_i_list[[i]] <- sum_i
    }
    average_sum_i_j = mean(unlist(sum_i_list))
    my_list[[j]] <- data.frame(id_j = j, id_i = i, average_sum_i_j)
    print(data.frame(id_j = j, id_i = i, average_sum_i_j))
}

results = do.call(rbind.data.frame, my_list)

plot(density(results$average_sum_i_j), main = "Distribution of Deviations")
plot(results$id, results$average_sum_i_j, xlab = "Iteration", ylab = "Deviation", main = "Trace Plot", type = "b")

However, I don't think I am doing this right since the loop does not appear to be cycling through both indices.

Can someone show me how to correct this?

Thanks!

You can use nested loops to achieve this.

First, you can create a outer loop that runs from j = 1 to j = 100. Inside this outer loop, you can create a new random sample for "base_j", "base_comp_j" and "sampling_frame_j".

Then, you can create an inner loop that runs from i = 1 to i = 100, where you repeat the same process as before to calculate "sum_i" for each iteration.

Finally, you can take the average of all "sum_i" for a given j and store it in a variable "average_sum_i_j".

You can then plot the distribution of "average_sum_i_j" and "j" on a graph to visualize the results.

Here is an example of the modified code to achieve this:

Initialize an empty list to store the average sum_i for each j

average_sum_i_j_list <- list()

Outer loop to run from j = 1 to j = 100

for (j in 1:100) {

Copy code

# Create random samples for base_j, base_comp_j and sampling_frame_j
base_j <- sample_n(my_data, 700)
base_comp_j <- base_j %>%
    arrange(var1) %>%
    mutate(ntile = ntile(var1, 10)) %>%
    group_by(ntile) %>%
    summarise(mean = mean(prop))
sampling_frame_j <- my_data %>% anti_join(base_j)

# Initialize an empty list to store the sum_i for each i for a given j
sum_i_list <- list()

# Inner loop to run from i = 1 to i = 100
for (i in 1:100) {
    
    # Create a random sample of 100 from sampling_frame_j
    a_i <- sample_n(sampling_frame_j, 100)
    
    # Calculate base_a_i
    base_a_i <- a_i
1 Like