"Predicting" the Result of a Query Before Running the Query

omario · April 10, 2022, 6:24am

I am working with the R programming language.

I have the following dataset (each variable can take values from 1-10):

factor <- c(1,2,3,4,5,6,7,8,9,10)

var_1 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
var_2 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))    
var_3 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))    
var_4 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
var_5 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))

my_data = data.frame(var_1, var_2, var_3, var_4, var_5)

head(my_data)
  var_1 var_2 var_3 var_4 var_5
1     2     9     8     3     5
2    10     1     4     7     5
3     4     7     1     7     6
4     9     8     3     4     7
5     6     5     5     9     7
6     2     8    10     2     7

I also have a data frame that contains a series of "conditions" used for selecting rows from this data frame:

conditions = data.frame(var_1 = c("1,2", "2"), var_2 = c("1,2,3,4", "1,2,3,8,9"), var_3 = c("4,6", "5,6,7"))

 conditions
  var_1     var_2 var_3
1   1,2   1,2,3,4   4,6
2     2 1,2,3,8,9 5,6,7

Suppose I want to run the following queries on this dataset:

query_1 <- my_data[Reduce(`&`,
  Map(`%in%`,
      lapply(my_data[,1:3], as.character),
      lapply(conditions, function(z) strsplit(z, ",")[[1]]))),]

query_2 <- my_data[Reduce(`&`,
  Map(`%in%`,
      lapply(my_data[,1:3], as.character),
      lapply(conditions, function(z) strsplit(z, ",")[[2]]))),]

# traditional form for reference 
# query_1 = my_data[my_data$var_1 %in% c("1", "2") & my_data$var_2 %in% c("1", "2", "3", "4") & my_data$var_3 %in% c("4", "6") ,  ]

# query_2 = my_data[my_data$var_1 %in% c("2") & my_data$var_2 %in% c("1", "2", "3", "8", "9") & my_data$var_3 %in% c("5", "6", "7") ,  ]

It is immediately clear that the results of both of these queries are not identical:

> identical(query_1, query_2)
[1] FALSE

However, we can also see that even though the results of both queries are non-identical, there are still common rows between both of these queries:

combined <- rbind(query_1, query_2)

duplicate_rows <- unique(combined[duplicated(combined), ])

My Question: Is it possible to somehow add some "random noise" to the "Conditions" data frame such that "query_1" and "query_2" will have no common rows?

I know that a simple solution would be to make sure that no common numbers are present whatsoever in the conditions for "query_1" and "query_2" - but using basic logic, it is possible that some common numbers can be present in the conditions for "query_1" and "query_2", and still result in "query_1" and "query_2" having no common rows.

For example:

Query_1: Height = Tall AND Basketball = Yes
Query_2: Height = Tall AND Basketball = No

In this case, Query_1 and Query_2 share a similar condition for "height" but a different condition for "basketball" - this will result in Query_1 and Query_2 sharing no common rows. For example, Robert Wadlow (Robert Wadlow - Wikipedia) was Tall AND did not play Basketball, but Wilt Chamberlain (Wilt Chamberlain - Wikipedia) was Tall AND did play basketball. Both shared "Tall", but since they had a unique value for "Basketball", they fall in distinct categories. Logically speaking, you could take the entire population of the earth and still no human would be present in the results of both "query_1" and "query_2" due to exclusive nature of both queries.

The same way, I would like to predict and anticipate the result of the earlier queries before hand and make sure that these queries have no common rows - and if they do have common rows, I would like to add some "random noise" to the conditions (data frame) used to create these queries such that no common rows are shared by these queries.

Thus, is there a way to take the "conditions" data frame and add some "random noise" to this data frame such that Query_1 and Query_2 will not share any common rows?

Thanks!

pieterjanvc · April 11, 2022, 12:08pm

Hi,

I'm not sure I follow all of your logic here, but if you are OK with having duplicate data within a query removed, you can simple add some extra id columns and make things unique.

factor <- c(1,2,3,4,5,6,7,8,9,10)

var_1 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
var_2 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))    
var_3 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))    
var_4 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))
var_5 <- as.factor(sample(factor, 10000, replace=TRUE, prob=c(0.1,0.1,0.1,0.1,0.1, 0.1,0.1,0.1,0.1,0.1)))

my_data = data.frame(var_1, var_2, var_3, var_4, var_5)
my_data$id = 1:length(my_data) #Add row id

conditions = data.frame(var_1 = c("1,2", "2"), var_2 = c("1,2,3,4", "1,2,3,8,9"), var_3 = c("4,6", "5,6,7"))

query_1 <- my_data[Reduce(`&`,
                          Map(`%in%`,
                              lapply(my_data[,1:3], as.character),
                              lapply(conditions, function(z) strsplit(z, ",")[[1]]))),]
query_1$query = 1 #Add query id
query_1 = unique.data.frame(query_1) #Only keep unique results

query_2 <- my_data[Reduce(`&`,
                          Map(`%in%`,
                              lapply(my_data[,1:3], as.character),
                              lapply(conditions, function(z) strsplit(z, ",")[[2]]))),]
query_2$query = 2 #Add query id
query_2 = unique.data.frame(query_2) #Only keep unique results

combined <- rbind(query_1, query_2)
unique(combined[duplicated(combined), ])
# [1] var_1 var_2 var_3 var_4 var_5 id    query
# <0 rows> (or 0-length row.names)

Hope this helps,
PJ

system · April 18, 2022, 12:25pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.