Multiple Conditions in WHILE LOOPS

I am trying to write the following loop - this loop will:

  • Randomly take 3 random samples of the iris dataset
  • Check to see if the mean of the "Sepal.Length" column for all 3 random samples sum to less than 15
  • Make sure that the number of common rows in all 3 random samples is no more 5
five = as.integer(5)
list_results <- list()
for (i in 1:100){
    
    c1_i = c2_i = c3_i = ctotal_i =  0
    
    while(c1_i + c2_i  + c3_i < 15 && nrow_i  <  five ) {
        
        
        num_1_i = sample_n(iris, 30)
        
        
        
        num_2_i = sample_n(iris, 30)
        
        
        num_3_i = sample_n(iris, 30)
        
        
        c1_i = mean(num_1_i$Sepal.Length)
        c2_i = mean(num_2_i$Sepal.Length)
        c3_i = mean(num_3_i$Sepal.Length)
        ctotal_i = c1_i + c2_i  + c3_i
        
        combined_i = rbind(num_1_i, num_2_i, num_3_i)
        nrow_i = nrow(unique(combined_i[duplicated(combined_i), ]))
        
    }
    
    inter_results_i <- data.frame(i, c1_i, c2_i, c3_i, ctotal_i, nrow_i)
    list_results[[i]] <- inter_results_i
}

When I run this loop, it is giving me an "empty result".

Can someone please show me what I am doing wrong and how I can try to correct this?

Thanks!

The code you posted will not run because nrow_i is not defined. I set it to zero at the top of the for loop and found that the while loop only executes once for every value of i. I showed this by added a counter variable j that increments with each iteration of the while loop. In the following reprex, you can see that j is always 1, ctotal_i is always > 15 and nrow_i is always > 5.
From your description of the task, it seems the tests of the while loop should be c1_i + c2_i + c3_i > 15 || nrow_i > five, so that the loop will keep running until ctotal_i < 15 and nrow_i < 5. However, it might take a long time to meet both conditions. In 100 iterations, the minimum ctotal_i was 16.9 and the minimum nrow_i was 8. The chances of getting both below the thresholds seems very low.

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.2
five = as.integer(5)
list_results <- list()
for (i in 1:100){
  
  c1_i = c2_i = c3_i = ctotal_i =  0
  nrow_i <- 0
  j <- 0
  
  while(c1_i + c2_i  + c3_i < 15 && nrow_i  <  five ) {
    
    j <- j+1
    num_1_i = sample_n(iris, 30)
    
    
    
    num_2_i = sample_n(iris, 30)
    
    
    num_3_i = sample_n(iris, 30)
    
    
    c1_i = mean(num_1_i$Sepal.Length)
    c2_i = mean(num_2_i$Sepal.Length)
    c3_i = mean(num_3_i$Sepal.Length)
    ctotal_i = c1_i + c2_i  + c3_i
    
    combined_i = rbind(num_1_i, num_2_i, num_3_i)
    nrow_i = nrow(unique(combined_i[duplicated(combined_i), ]))
    
  }
  
  inter_results_i <- data.frame(i, c1_i, c2_i, c3_i, ctotal_i, nrow_i,j)
  list_results[[i]] <- inter_results_i
}

sapply(list_results,function(DF) DF$j)
#>   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#>  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
min(sapply(list_results,function(DF) DF$ctotal_i))
#> [1] 16.91667
min(sapply(list_results,function(DF) DF$nrow_i))
#> [1] 8

Created on 2022-07-04 by the reprex package (v2.0.1)

1 Like

Thank you so much for your answer! I had 2 questions:

  • Question 1: When I ran your code, I noticed that in the results nrow_i is almost always larger than 5. Why is this happening? I would have thought that this LOOP would take a very long time to run in order to satisfy both conditions within the WHILE LOOP. Do you know why in the results nrow_i is always appearing as larger than 5? Is it possible to write the conditions such that it becomes less than 5?

  • Question 2: What is the purpose of sapply(list_results,function(DF) DF$j)? I don't see a "DF" object defined anywhere in the code, yet this code still runs. What are you trying to accomplish using "sapply" and "DF"?

Thank you so much for all your help!

  1. I do not know enough about the mathematics of combinations to explain why the values of nrow_i are what they are. I do not even have a good intuition about this kind of problem. You are sampling 20% of the population three times and asking that less than 3.3% (5/150) of the total population be sampled more than once. I could not have told you beforehand even roughly how likely that is but running
AllDat <- bind_rows(list_results)
hist(AllDat$nrow_i)

will show you that typical values of nrow_i are in the mid teens.
2. The code sapply(list_results,function(DF) DF$j) simply displays all the values of j over the 100 iterations of the for loop. The object list_results is a list of data frames. sapply() iterates over that list and passes each data frame to the little function I wrote function(DF) DF$j. I named the argument of the function DF and it receives each element of list_results, that is each data frame. The function returns the j column of the data frame and sapply builds a vector from those values. The purpose of those lines with sapply was to show that j is always 1, ctotal_i is never < 15 and nrow_i is never < 5.

1 Like

@ FJCC: Thank you for your reply!

I would have thought that while(c1_i + c2_i + c3_i < 15 && nrow_i > five ) would keep running (even if it runs for infinite time) until all values of "nrow_i < 5". Is this correct?

In short - suppose I didn't care how long the R code takes to run - what kind of condition would I have to write to ENSURE that this WHILE LOOP ONLY outputs results where nrow_i<5?

Thank you so much for all your help!

If you change the condition of the while loop to while(c1_i + c2_i + c3_i < 15 && nrow_i > five ), the sub condition c1_i + c2_i + c3_i < 15 will almost certainly be FALSE after the first iteration and the loop will only iterate once. The cx_i variables are the average of 30 samples of the Sepal.Length. Since Sepal.Length of the whole iris data set averages about 5.8, the average of 30 samples will also be very close to 5.8 and the sum of three of those will never be less than 15. If you remove the condition c1_i + c2_i + c3_i < 15 entirely, you can test how long it takes to get samples with nrow_i < 5.

if you change mean_target down from 18 to 15 as per your initial statements, this code will run and run, this is set to 18, so that it completes in a reasonable time. similarly sample_size_at_each_step from 20 to 30

library(dplyr)
sample_size_at_each_step <- 20
mean_target <- 18
five = as.integer(5)
list_results <- list()
for (i in 1:100){
  
  c1_i <- c2_i <- c3_i <- ctotal_i <-  Inf
  nrow_i <- Inf
  attempt_count <- 0
  while(c1_i + c2_i  + c3_i > mean_target | nrow_i  >  five ) {
    attempt_count <- attempt_count + 1
    cat("i:" , i, "\tattempt: ",attempt_count,"\n")

    num_1_i = sample_n(iris, sample_size_at_each_step)
    num_2_i = sample_n(iris, sample_size_at_each_step)
    num_3_i = sample_n(iris, sample_size_at_each_step)
    
    
    c1_i = mean(num_1_i$Sepal.Length)
    c2_i = mean(num_2_i$Sepal.Length)
    c3_i = mean(num_3_i$Sepal.Length)
    ctotal_i = c1_i + c2_i  + c3_i
    
    combined_i = rbind(num_1_i, num_2_i, num_3_i)
    nrow_i = nrow(unique(combined_i[duplicated(combined_i), ]))

  }
  
  inter_results_i <- data.frame(i, c1_i, c2_i, c3_i, ctotal_i, nrow_i)
  list_results[[i]] <- inter_results_i
}

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.