I am trying to find the difference between data pairs in a large dataset using many variable columns. I have already shown the distributions are significantly different through summary pairwise tests (like t-test). Now I want to find the exact difference for each combination of varaibles.
- I suspect treatment 2 is always better than treatment 1. Is it? If so, the difference should always be negative.
- If so, by how much for a given variable combination?
You can think of this like comparing two treatments on a group with the same inputs (thousands of varaible combinations). Once I can compare data points directly I would like to find the difference in unique conbinations. I know the data will have some duplicates of input variable combinations but the output value should be the same, so only the first case of that combination needs to be retained. I should also note the number observations for treatment1 might be different from treatment 2.
Dataset looks roughly like
treatment | var1 | var2 | var3 | ... | result treatment1 | A | x | 3.4 | ... | 10 treatment2 | A | x | 3.4 | ... | 5 treatment1 | B | y | 2 | ... | 4 treatment2 | B | y | 2 | ... | 5
Idealy I would end up with a dataframe with unique combinations of input variables and the difference between the output variable.
var1 | var2 | var3 | ... | difference A | x | 3.4 | ... | 5 B | y | 2 | ... | -1
Finally, a simple count of the number of positive and negative values in the results column would help, but I can probably figure that out later.
Rather than trying to make a reprex, I think the ggplot2 diamonds dataset should work for an example.
Using cut as treatments
Using color, clarity, and carat as input variables
Price as results
I appreciate any assistance you are willing to give!