Comparing the Effect of a Variable Being Absent/Present?

I am working with the R programming language.

I have the the following data:

set.seed(123)

var1 = sample(0:1, 10000, replace=T)
var2 = sample(0:1, 10000, replace=T)
var3 = sample(0:1, 10000, replace=T)
var4 = sample(0:1, 10000, replace=T)
score = rnorm(10000,10,5)

my_data = data.frame(var1,var2, var3,var4, score)

We can see the summary of unique rows for this data with the following command:

# https://stackoverflow.com/questions/34312324/r-count-all-combinations
> dt = my_data[,c(1,2,3,4)]
> setDT(dt)[,list(Count=.N) ,names(dt)]
    var1 var2 var3 var4 Count
 1:    0    0    0    0   667
 2:    0    1    0    0   601
 3:    1    1    1    1   651
 4:    0    1    1    1   608
 5:    1    0    1    1   613
 6:    1    1    0    1   588
 7:    0    1    1    0   607
 8:    0    0    1    1   607
 9:    1    0    1    0   625
10:    0    1    0    1   661
11:    1    1    1    0   635
12:    0    0    1    0   640
13:    1    1    0    0   608
14:    1    0    0    0   607
15:    0    0    0    1   626
16:    1    0    0    1   656

I want to find out the average value of "score" when some variable is "present" and the same variable is "absent". For example:

  • Contribution for Var4 : Average score for (var1 = 1, var2= 1, var3 = 1, var4 = 1) - Average score for (var1 = 1, var2= 1, var3 = 1, var4 = 0)
  • Contribution for Var2 : Average score for (var1 = 1, var2= 1, var3 = 1, var4 = 1) - Average score for (var1 = 1, var2= 0, var3 = 1, var4 = 1)
  • etc.

I found a very "clumsy" way to do this:

var1_present <- my_data[which(my_data$var1 == 1 & my_data$var2 == 1 & my_data$var3 == 1 & my_data$var4 == 1 ), ]

var1_present_score = mean(var1_present$score)

var1_absent <- my_data[which(my_data$var1 == 0 & my_data$var2 == 1 & my_data$var3 == 1 & my_data$var4 == 1 ), ]

var1_absent_score = mean(var1_absent$score)

var_1_contribution = var1_present_score - var1_absent_score

[1] 0.1288283

Is there someway to write a function that can look at the "contribution" of different variables to the "score"? I understand that even for 4 variables there can be many different combinations to compare - e.g. row 14 vs. row 16 : (1,0,0,0) vs. (1,0,0,1). But even for just some "contributions", is it possible to write a function that evaluates the "contribution" of variables being absent/present?

Can someone please show me how to do this?

Thanks!

Hello,

I think I have a way of finding out the individual contributions of each variable.

library(tidyverse)

set.seed(123)

var1 = sample(0:1, 10000, replace=T)
var2 = sample(0:1, 10000, replace=T)
var3 = sample(0:1, 10000, replace=T)
var4 = sample(0:1, 10000, replace=T, prob = c(0.2,0.8))
score = rnorm(10000,10,5)

my_data = data.frame(var1,var2, var3,var4, score)


#Transform the data into long format (create Id to keep track)
my_data = my_data %>% 
  mutate(id = 1:n()) %>%  
  pivot_longer(-c(score, id), names_to = "var", values_to = "present") %>% 
  #Calculate the percentage of contribution to each value
  group_by(id) %>% 
  mutate(
    contrPerc = present / max(sum(present), 1),
    contrVal = score * contrPerc)

head(my_data, 8)
#> # A tibble: 8 × 6
#> # Groups:   id [2]
#>   score    id var   present contrPerc contrVal
#>   <dbl> <int> <chr>   <int>     <dbl>    <dbl>
#> 1  5.82     1 var1        0       0       0   
#> 2  5.82     1 var2        0       0       0   
#> 3  5.82     1 var3        0       0       0   
#> 4  5.82     1 var4        1       1       5.82
#> 5  8.90     2 var1        0       0       0   
#> 6  8.90     2 var2        1       0.5     4.45
#> 7  8.90     2 var3        0       0       0   
#> 8  8.90     2 var4        1       0.5     4.45

#Summarise the contribution per variable
my_data %>% group_by(var) %>% 
  summarise(contrPerc = mean(contrPerc), 
            contrVal = mean(contrVal[present == 1]))
#> # A tibble: 4 × 3
#>   var   contrPerc   contrVal
#>   <chr>     <dbl> <dbl>
#> 1 var1      0.198  3.98
#> 2 var2      0.197  3.96
#> 3 var3      0.198  3.94
#> 4 var4      0.381  4.69

Created on 2022-06-07 by the reprex package (v2.0.1)

I converted the data into long format and then adding a few stats was able to calculate the contributions. Note that I changed the probability of variable 4 to be 80% '1' to showcase that in the end it gets a higher score. Contr percent is the average contribution of a variable to the total score. The sum of all contrPerc = 1. The 'contrVal' is the average amount contributed to the total score (if not 0).

I don't know if this is exactly what you want, but it might get you there.

Hope this helps,
PJ

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.