Big data: doing a quadrillion calculations?

I'm trying to do some basic means from a dataset, but it is incomplete. I have a dataset that is two columns: one a list of numbers, and another categorical with binary options. However, although I have numerical data for all, I'm missing categorical data for 50 of them. I want to calculate all possible means and medians for each category to see the possible range and spread of simulations. This will help me understand the "true" mean.

I figure this means that there are 2^50 different possibilities. Is it possible to calculate all of these in R? Or is this too many? I might be able to reduce it down a bit, but not much.

Apologies if this is a basic question. I'm not massively familiar with R but am trying!


I am trying to calculate the pay gap between men and women. The numerical data is pay, and the categorical data is gender. To calculate the gap, I need to do this calculation: (mean pay for men minus mean pay for women) / (mean pay for men). However I do not have gender data for 50 people. They could be all men, or all women, or any one of 2^50 different combinations of men and women. I want to calculate all gaps to see what is most likely.

There's nothing stopping you as long as you have sufficient RAM (and even if not there are ways of using disk space), but what you are describing does not make any sense to me.

FAQ: How to do a minimal reproducible example ( reprex ) for beginners - meta / Guides & FAQs - RStudio Community

If you arent working for the Large Hadron Collider... its too many.

Thank you. I've added an edit to hopefully give more context.

The 2^50 still makes no sense. As nirgrahamuk states, you are not doing theoretical physics.

Check the link I provided and produce a minimal reprex to demonstrate your problem reducing 50 to a much smaller number.

With 3 missing people the different combinations are below. Each column would result in a different gap. With 4 missing people, there would be 2^4 combinations (and so on up to 2^50). Does that help?

#> M         M         M         F        F        F        F        M        
#> M         M         F         M        F        F        M        F        
#> M         F         M         M        F        M        F        F

No, I'm afraid that doesn't help.

At this stage I can only repeat that you follow the instructions in the link to produce a minimal reproducible example of the data and the problem.

I'm not sure it makes sense to do that experiment with the missing data. I would just remove it.

If you're intent on do an experiment, you might consider a Monte Carlo / random sample approach instead of testing every single possibility.

1 Like

Yes, this. Maybe look into multiple imputation.

What is multiple imputation? Sounds exciting.

What I want to do is find a way to deal with the uncertainty created by having an incomplete dataset.

Multiple imputation (MI) does just that. You randomly impute (fill in the data) and calculate an estimate but you do it multiple times to understand the uncertainty. All my resources are book but here's some I found Googling. I've never done MI in R, only SUDAAN so I don't know how good these packages/sources are:

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.