Distribution of data

Hello, I am a project manager in a large American city dealing with multiple sets of housing data: all of them related in physical space which I clean in R Studio and GIS. I don't think it's necessary to provide code, however what is most vital is to find a way through the IDE or certain libraries to allocate significantly lengthy sets of data.

I have a data-set of over 100,000 records of apartment locations with about 10 columns. I have a data-set of about 11,100 rooftop ventilation units across these locations on apartment rooftops.

I need to link these rooftop fans with the underlying apartment lines that are connected by the ventilation service through rooftop ductwork between walls along each and every floor. This will be aided with surveys with development staff and examining building plans.

I believe that I will need interns to help me (at least 4) and I will apportion at least 12% of this apt location dataset's records. I will ultimately be spot-checking the work of my interns. I want to distribute 22% each to the remaining 4 interns by this plan.
The rooftop units and apartment locations are unevenly distributed over a hundred unique bureaucratic categories coded as consolidation units.
I need a way to use any kind of grouping criteria to roughly share equal amounts of roof fan records, at the very least and connect it to each physically corresponding apartment ventilation line. I am not a perfectionist and I'm definitely planning for eventual disruptions. That being said, I wish to present a sense of fairness before they do the deep digging, meticulous data cleaning, merging into GIS shapefiles. By that time, some interns will have more work to do than others.

Statistics aren't my forte, I must admit, but the average amount of roof fans across the consolidation units (128 unique ones) is about 87. The average amount of apt locations across the consolidation units (148 unique ones) are about 1232.

I am not sure if DescTools or a similar library dedicated to exploratory data analysis is the key but any point to the right direction would be much appreciated.


Do the initial allocation by sample() with replacement = FALSE. Do the subsequent allocation by weekly sampling of the remaining pool. Everyone gets the same chance at a representative pool. This assumes that there is no advantage to the project that the analysts become attached to any unexamined group of cases.