Detecting complete and partial patterns in vector of numbers

Hello,

I am looking for some ways to find patterns in set of common numbers. All the numbers can only contain values from 1 to 5. I unfortunately don't know all the ways in which patterns will present themselves but I want to at least pick up partial patterns and quantify it.

Below I have added some sequences. Any idea how to approach this?

#should find the repetition of 1,2,3
c(1,2,3,1,2,3,1,2,3)

#should find the repetition of 3,2,1
c(3,2,1,3,2,1,3,2,1)

#should find the 3,2 and 5,3 repititon
c(3,2,3,2,5,3,5,3,4)

#should find the larger pattern 1,4,5,4,1 and or 5,4,1 repeating
c(1,4,5,4,1,5,4,1,2)

#should not find any patterns 
c(5,3,1,4,2,1,1,3,4)

library(tidyverse)
library(purrr)

get_ngram <- function(numvec,len){
  result <- NA

  inside_count <- length(numvec)-len+1

  if(inside_count>1) {

    first_pass <- list()
    iloop <- seq_len(inside_count)
    for(i in iloop){

      first_pass[[i]] <- numvec[i:(i+len -1 )]
    }
    second_pass <- unique(first_pass[duplicated(first_pass)])
    if(length(second_pass)>0)
      result<-second_pass
  }
  
  result 

}

get_ngrams <- function(numvec){
  l <- 2:(length(numvec)-1)
  map(l,
      ~get_ngram(numvec,.)) %>% set_names(paste0("length_",l))

}


get_ngrams(c(1,2,3,1,2,3,1,2,3))

#should find the repetition of 3,2,1
get_ngrams(c(3,2,1,3,2,1,3,2,1))

#should find the 3,2 and 5,3 repititon
get_ngrams(c(3,2,3,2,5,3,5,3,4))

#should find the larger pattern 1,4,5,4,1 and or 5,4,1 repeating
get_ngrams(c(1,4,5,4,1,5,4,1,2))
#actually only 541 exists, there is no 14541 pattern to find

#should not find any patterns 
get_ngrams(c(5,3,1,4,2,1,1,3,4))
1 Like

Ahhh you're amazing! It is like you have answers to everything :rofl:

I understand your point regarding the c(1,4,5,4,1) example. Is there some way in R to get R to work out which number should likely come next in a set so if we say had c(1,4,5,4,x) that it would substitute in 1? I want to be able to find less obvious patterns like that too. Ngrams in general make a lot of sense for this. I think in part I am going to use your solution and flip the set around to read it from right to left as well (given it doesn't feature as parts of a word here)

I'm afraid I don't really follow what you are asking.
It seems like the idea is to go beyond matching on repeated patterns to some definition of an almost detectable pattern ? You could brute force solutions for that, but I would only think its worth trying if the vectors you analyse are not much more longer than this, because it would scale awfully poorly.

Yes, that would be it - basically a detectable pattern. In some cases I will have lengths up to 20 long in a respective vector. I was hoping there was some sort of mathmatical solver or way to run some clever set of diff to derive that set. I suppose to fit a lm or such wouldn't work as you can't know the shape of that line beforehand or readily find a way to solve it either?

Currently, I am thinking of stringing each full set as a "hash" of sorts and compare that directly to others. With the number of combinations etc I should also see a fair spread.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.