Correctly Writing a Double Loop in R

omario · January 23, 2023, 8:37am

I have the following dataset:

library(dplyr)
library(purrr)
set.seed(123)
my_data1 = data.frame(var1 =  rnorm(500,100,100), prop = runif(500, min=0, max=0.5))
my_data2 = data.frame(var1 = rnorm(500, 200, 50), prop = runif(500, min=0.4, max=0.7))
my_data = rbind(my_data1, my_data2)

I made ten equal-sized bins, found the average proportion within each bin, then plotted the results:

final = my_data %>%
arrange(var1) %>%
mutate(ntile = ntile(var1, 10)) %>%
group_by(ntile) %>%
summarise(mean = mean(prop))

plot(final$ntile, final$mean, type = "l", xlab = "Bin Number", ylab = "Average Proportion", main = "Relationship Between Bins and Average Proportion")

Now, suppose I:

Take a random 70% sample, create 10 bins and calculate the mean proportion within each of these bins
Then, from the remaining 30% sample - I again take a random sample and create 10 bins and calculate the mean proportion within each of the bins
Next, I take the squared sum of the difference between these two steps for each bin
Finally, repeat this process many times (while tracking the results)

Below, I wrote some R code for this procedure:

base = sample_n(my_data, 700)

base_comp = base %>%
            arrange(var1) %>%
            mutate(ntile = ntile(var1, 10)) %>%
            group_by(ntile) %>%
            summarise(mean = mean(prop))


sampling_frame = my_data %>% anti_join(base)

my_list = list()

for (i in 1:1000)
    
{
    
    a_i = sample_n(sampling_frame, 100)
    
    base_a_i = a_i %>%
    arrange(var1) %>%
    mutate(ntile = ntile(var1, 10)) %>%
    group_by(ntile) %>%
    summarise(mean = mean(prop))
    
    sum_i <- sum(map2_dbl(base_a_i$mean, base_comp$mean, function(x, y) (x - y)^2))
    
    my_list[[i]] <- data.frame(id = i, sum_i)
    print(data.frame(id = i, sum_i))
}

results = do.call(rbind.data.frame, my_list)

plot(density(results$sum_i), main = "Distribution of Deviations")
plot(results$id, results$sum_i, xlab = "Iteration", ylab = "Deviation", main = "Trace Plot", type = "b")

My Question: Now, I am trying to create a "double loop" - that is, I would to:

For j = 1, randomly create "base_j" , "base_comp_j" and "sampling_frame_j"
Repeat the "i" loop 100 times
Take the average of all "sum_i" for j=1 (i.e. average_sum_i_j)
Now, repeat this for j = 2
Repeat until j = 100
Each point on these two graphs will be "average_sum_i_j"

Here is my attempt to do this:

#  change i and j index to 10 for brevity

my_list = list()

for (j in 1:10) {
    base_j = sample_n(my_data, 700)
    base_comp_j = base_j %>%
        arrange(var1) %>%
        mutate(ntile = ntile(var1, 10)) %>%
        group_by(ntile) %>%
        summarise(mean = mean(prop))
    sampling_frame_j = my_data %>% anti_join(base_j)
    sum_i_list = list()
    for (i in 1:10) {
        a_i = sample_n(sampling_frame_j, 100)
        base_a_i = a_i %>%
            arrange(var1) %>%
            mutate(ntile = ntile(var1, 10)) %>%
            group_by(ntile) %>%
            summarise(mean = mean(prop))
        sum_i <- sum(map2_dbl(base_a_i$mean, base_comp_j$mean, function(x, y) (x - y)^2))
        sum_i_list[[i]] <- sum_i
    }
    average_sum_i_j = mean(unlist(sum_i_list))
    my_list[[j]] <- data.frame(id_j = j, id_i = i, average_sum_i_j)
    print(data.frame(id_j = j, id_i = i, average_sum_i_j))
}

results = do.call(rbind.data.frame, my_list)

plot(density(results$average_sum_i_j), main = "Distribution of Deviations")
plot(results$id, results$average_sum_i_j, xlab = "Iteration", ylab = "Deviation", main = "Trace Plot", type = "b")

However, I don't think I am doing this right since the loop does not appear to be cycling through both indices.

Can someone show me how to correct this?

Thanks!

joanmils · January 23, 2023, 11:20am

omario:

I have the following dataset:
library(dplyr)
library(purrr)
set.seed(123)
my_data1 = data.frame(var1 =  rnorm(500,100,100), prop = runif(500, min=0, max=0.5))
my_data2 = data.frame(var1 = rnorm(500, 200, 50), prop = runif(500, min=0.4, max=0.7))
my_data = rbind(my_data1, my_data2)
I made ten equal-sized bins, found the average proportion within each bin, then plotted the results:
final = my_data %>%
arrange(var1) %>%
mutate(ntile = ntile(var1, 10)) %>%
group_by(ntile) %>%
summarise(mean = mean(prop))

plot(final$ntile, final$mean, type = "l", xlab = "Bin Number", ylab = "Average Proportion", main = "Relationship Between Bins and Average Proportion")
image1361×562 7.09 KB

Now, suppose I:

Take a random 70% sample, create 10 bins and calculate the mean proportion within each of these bins

Then, from the remaining 30% sample - I again take a random sample and create 10 bins and calculate the mean proportion within each of the bins

Next, I take the squared sum of the difference between these two steps for each bin

Finally, repeat this process many times (while tracking the results)

Below, I wrote some R code for this procedure:
base = sample_n(my_data, 700)

base_comp = base %>%
            arrange(var1) %>%
            mutate(ntile = ntile(var1, 10)) %>%
            group_by(ntile) %>%
            summarise(mean = mean(prop))


sampling_frame = my_data %>% anti_join(base)

my_list = list()

for (i in 1:1000)
    
{
    
    a_i = sample_n(sampling_frame, 100)
    
    base_a_i = a_i %>%
    arrange(var1) %>%
    mutate(ntile = ntile(var1, 10)) %>%
    group_by(ntile) %>%
    summarise(mean = mean(prop))
    
    sum_i <- sum(map2_dbl(base_a_i$mean, base_comp$mean, function(x, y) (x - y)^2))
    
    my_list[[i]] <- data.frame(id = i, sum_i)
    print(data.frame(id = i, sum_i))
}

results = do.call(rbind.data.frame, my_list)

plot(density(results$sum_i), main = "Distribution of Deviations")
plot(results$id, results$sum_i, xlab = "Iteration", ylab = "Deviation", main = "Trace Plot", type = "b")
image1361×562 21.4 KB

My Question: Now, I am trying to create a "double loop" - that is, I would to:

For j = 1, randomly create "base_j" , "base_comp_j" and "sampling_frame_j"

Repeat the "i" loop 100 times

Take the average of all "sum_i" for j=1 (i.e. average_sum_i_j)

Now, repeat this for j = 2

Repeat until j = 100

Each point on these two graphs will be "average_sum_i_j"

Here is my attempt to do this:
#  change i and j index to 10 for brevity

my_list = list()

for (j in 1:10) {
    base_j = sample_n(my_data, 700)
    base_comp_j = base_j %>%
        arrange(var1) %>%
        mutate(ntile = ntile(var1, 10)) %>%
        group_by(ntile) %>%
        summarise(mean = mean(prop))
    sampling_frame_j = my_data %>% anti_join(base_j)
    sum_i_list = list()
    for (i in 1:10) {
        a_i = sample_n(sampling_frame_j, 100)
        base_a_i = a_i %>%
            arrange(var1) %>%
            mutate(ntile = ntile(var1, 10)) %>%
            group_by(ntile) %>%
            summarise(mean = mean(prop))
        sum_i <- sum(map2_dbl(base_a_i$mean, base_comp_j$mean, function(x, y) (x - y)^2))
        sum_i_list[[i]] <- sum_i
    }
    average_sum_i_j = mean(unlist(sum_i_list))
    my_list[[j]] <- data.frame(id_j = j, id_i = i, average_sum_i_j)
    print(data.frame(id_j = j, id_i = i, average_sum_i_j))
}

results = do.call(rbind.data.frame, my_list)

plot(density(results$average_sum_i_j), main = "Distribution of Deviations")
plot(results$id, results$average_sum_i_j, xlab = "Iteration", ylab = "Deviation", main = "Trace Plot", type = "b")
However, I don't think I am doing this right since the loop does not appear to be cycling through both indices.

Can someone show me how to correct this?

You can use nested loops to achieve this.

First, you can create a outer loop that runs from j = 1 to j = 100. Inside this outer loop, you can create a new random sample for "base_j", "base_comp_j" and "sampling_frame_j".

Then, you can create an inner loop that runs from i = 1 to i = 100, where you repeat the same process as before to calculate "sum_i" for each iteration.

Finally, you can take the average of all "sum_i" for a given j and store it in a variable "average_sum_i_j".

You can then plot the distribution of "average_sum_i_j" and "j" on a graph to visualize the results.

Here is an example of the modified code to achieve this:

Initialize an empty list to store the average sum_i for each j

average_sum_i_j_list <- list()

Outer loop to run from j = 1 to j = 100

for (j in 1:100) {

Copy code

# Create random samples for base_j, base_comp_j and sampling_frame_j
base_j <- sample_n(my_data, 700)
base_comp_j <- base_j %>%
    arrange(var1) %>%
    mutate(ntile = ntile(var1, 10)) %>%
    group_by(ntile) %>%
    summarise(mean = mean(prop))
sampling_frame_j <- my_data %>% anti_join(base_j)

# Initialize an empty list to store the sum_i for each i for a given j
sum_i_list <- list()

# Inner loop to run from i = 1 to i = 100
for (i in 1:100) {
    
    # Create a random sample of 100 from sampling_frame_j
    a_i <- sample_n(sampling_frame_j, 100)
    
    # Calculate base_a_i
    base_a_i <- a_i

system · February 13, 2023, 11:21am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.