Simulating Coin Flips in R

Here is a problem I thought of:

  • Suppose I am watching someone flip a fair coin. Each flip is completely independent from the previous flip.
  • I watch this person flip 3 consecutive heads.
  • I interrupt this person and ask the following question: If the next flip results in a "head", I will buy you a slice of pizza. If the next flip results in a "tail", you will buy me a slice of pizza. Who has the better odds of winning?

I wrote the following simulation using the R programming language. In this simulation, a "coin" is flipped many times ("1" = HEAD, "0" = TAILS). We then count the percentage of times HEAD-HEAD-HEAD-HEAD appears compared to HEAD-HEAD-HEAD-TAILS:

#load library

#define number of flips
n <- 10000000 

#flip the coin many times
flips = sample(c(0,1), replace=TRUE, size=n)

#count the percent of times HEAD-HEAD-HEAD-HEAD appears 
str_count(paste(flips, collapse=""), '1111') / n


#count the percent tof times HEAD-HEAD-HEAD-TAIL appears
str_count(paste(flips, collapse=""), '1110') / n


From the above analysis, it appears as if the person's luck runs out: after 3 HEADS, there is a 3.33% chance that the next flip will be a HEAD compared to a 6.25% chance the next flip will not be a HEAD (i.e. TAILS).

Thus, could we conclude: Even though the probability of each flip is independent from the previous flip, it becomes statistically more advantageous to observe a sequence of HEADS and then bet the next flip will be a TAILS. Thus, the longer the sequence of HEADS you observe, the stronger the probability becomes of the sequence "breaking".

My Question: Is the R code I have written correct? Does it actually correspond to this problem I have created?


I think what you are seeing is an artifact of how str_count works. To get the correct answer, you need to look at any instance of three consecutive 1s and check the subsequent result. But str_count will only find one instance of "1111" in a long series of 1s. For example "11111" has two instances of three 1s followed by a 1. But

[1] 1

That means that if you flip four 1s in a row but those flips are preceded by a 1, they will be ignored. Your flipping process is independent, but your counting is not.
Here is another way to look at the incidence of 1 or 0 after three 1s.

n <- 100000 
Out <- vector("numeric",n)
i <- 1

   flips <- rep(0,4)
   while(sum(flips[1:3])!=3){ #flip groups of 4 until the first three are all 1
     flips = sample(c(0,1), replace=TRUE, size=4)
   Out[i] <- flips[4]
   i <- i+1
[1] 49826
[1] 50174
1 Like

My gut feeling is that trying to simulate the first 3 coin tosses is dangerous given you already know they're independent & you're not trying to estimate weighting of the coin (both given in the scenario). If you do so you'd better be very sure your code is doing what you think it's doing.

E.g. there are potentially between 1 & 10 occurrences of '1111' in '11111111111110', depending on how you set your code up to count them, but there is only 1 occurrence of '1110'.

Given the scenario I'd ignore the previous 3 coin tosses completely & simplify the whole problem to a single, evenly-weighted coin toss. Am I missing something?

For the record, a cross post:

The problem you are describing is the well known Gambler's fallacy. Now, of course, if the tosses are actually independent, you can ignore everything that happened before, and just go with the actual event: p = 0.5. But that won't tell you whether the events actually were independent. So if independence is your assumption, there is no point doing this at all. But how can we test this? Let's take your sequence, put all sequences of 4 events into a matrix, then pull out all instances where the first three events are "1", and then see what we have. No loops required. 1,000,000 trials is pretty much instantaneous.

N <- 1000000
set.seed <- 3141593
events <- sample(0:1, N, replace = TRUE)
sets <- matrix(numeric((N-3) * 4), nrow = N-3, ncol = 4)

sets[ , 1] <- events[1:(N-3)]
sets[ , 2] <- events[2:(N-2)]
sets[ , 3] <- events[3:(N-1)]
sets[ , 4] <- events[4:(N-0)]

sel <- rowSums(sets[, 1:3]) == 3        # 3  times (1) in the first three columns
head(sets[sel, ])                       # Always! check what you are doing

(tbl <- table(sets[sel, 4]))            # There you go.

(abs(tbl[1] - tbl[2]) / tbl[2]) * 100   # Difference in %

I get 62,344 vs. 62,244 which is just under 0.3% difference.


As previously mentioned, you should also consider counting 1111 twice within 11111 and beyond. 011110 would be one specific probability, while more 1s beyond four consecutives would require a bit more work. Would you say that you defined your problem specifically enough? I think your question can be answered better, whether it was coded with the same intention as your intended hypothesis. Steipe's solution seems to be what you're looking for, but still not 100% sure because it was not specified in the original post.

@FJCC and @Steipe both provide solutions which will accurately replicate the OQ, but note that the results from either will vary depending on the seed you set. Other seeds will give different totals.

Actually @FJCC's solution is not quite correct, since it operates on groups of four, not one long string. If there were an issue with the RNG that would affect patterns of more than four repetitions, this would not be recognized.

There is an important lesson in that: when doing these kinds of statistical simulations, you have to be careful to translate the question exactly, and not use code that you assume to be equivalent. OP has made an assumption about how str_count() works, that turned out to be wrong, and @FJCC made an assumption that four independently flipped elements would be equivalent to OP's scenario, but that's not exactly true (although their analysis of the reason why OP's result is wrong is spot on!). Of course the resulting code may turn out to be awkward - but otherwise, if we assume independence, we might just as well write table(sample(0:1, 1000000, replace=TRUE)) - and we all know how that will turn out.