Regression on multiple data frames simultaneously

I am trying to compare regression outputs for my data set. I have a highly imbalanced binary dependent variable, so what I’d like to do is split it up into multiple smaller frames which each contain an equal number of observations from the majority class and all of the cases from the minority class.

I’ve managed to break the data up in that manner and create multiple data frames but I have some additional questions:

  1. Is there a way to shuffle the observations from the majority class prior to splitting them up?
  2. As it stands, there are 8 observations (from the majority class) left out of the split as there are 308 observations in the majority class and 30 from the minority class. Is there a way to re-allocate them to 8 of the 10 data frames?
  3. How could I run a regression on all of these frames simultaneously instead of repeating the process manually for each individual data frame?
df0<- read_csv("PCL_cleaned.csv")

###PCL_Binary_Score is the binary dependant variable

mydata<-df0[!is.na(df0$PCL_Binary_Score),]
   

mydata%>%count(PCL_Binary_Score)

df <- data.frame(mydata, PCL_Binary_Score = c(rep("1", 30), rep("0", 308)),value = 1:338)

mysamples <- lapply(1:10, function(x){df[c(1:30, (x * 30 + 1) : ((x+1) * 30)), ]})

I hope my questions are sufficiently clear. Any help would be very appreciated.

Sounds like you're trying to do a convoluted cross-validation here. Data out of context is always very difficult to say what you can and cannot do.

There are a lot of other techniques to deal with class imbalance (that is a whole separate topic though).

If you want to split your data into smaller subsets, run the regressions, and get combined estimates then I would recommend cross-validation. Have a look here at a full example running a normal regression in this way: Cross-Validation Essentials in R - Articles - STHDA

1 Like

Complete questions attract more informed answers. See the FAQ: How to do a minimal reproducible example reprex for beginners.

Given a model for one data frame, it's not difficult to write a function to apply the same model to multiple data frames, provided the data frames have consistent variable names. (If not, some preprocessing will be required to conform each data frame to the exemplar. This raises an important consideration in workflow design—the separation of analysis and preparation. Use short names with a lookup table if needed and save the descriptive longer names for presentation tables.)

To pick a naive example, assume we have a series of mtcars data frame, structurally identical but differing the makes and models of cars and we are interested in regressing mpg on drat.

suppressPackageStartupMessages({
  library(purrr)
})
make_mod <- function(x) lm(mpg ~ drat, data = x)
summary(make_mod(mtcars))
#> 
#> Call:
#> lm(formula = mpg ~ drat, data = x)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -9.0775 -2.6803 -0.2095  2.2976  9.0225 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   -7.525      5.477  -1.374     0.18    
#> drat           7.678      1.507   5.096 1.78e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.485 on 30 degrees of freedom
#> Multiple R-squared:  0.464,  Adjusted R-squared:  0.4461 
#> F-statistic: 25.97 on 1 and 30 DF,  p-value: 1.776e-05

dfs <- list(mtcars = mtcars,mtcars2 = mtcars)

results <- dfs %>% map(make_mod)

summary(results[1]$mtcars)
#> 
#> Call:
#> lm(formula = mpg ~ drat, data = x)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -9.0775 -2.6803 -0.2095  2.2976  9.0225 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   -7.525      5.477  -1.374     0.18    
#> drat           7.678      1.507   5.096 1.78e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.485 on 30 degrees of freedom
#> Multiple R-squared:  0.464,  Adjusted R-squared:  0.4461 
#> F-statistic: 25.97 on 1 and 30 DF,  p-value: 1.776e-05
summary(results[2]$mtcars2)
#> 
#> Call:
#> lm(formula = mpg ~ drat, data = x)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -9.0775 -2.6803 -0.2095  2.2976  9.0225 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   -7.525      5.477  -1.374     0.18    
#> drat           7.678      1.507   5.096 1.78e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.485 on 30 degrees of freedom
#> Multiple R-squared:  0.464,  Adjusted R-squared:  0.4461 
#> F-statistic: 25.97 on 1 and 30 DF,  p-value: 1.776e-05

Thank you! This works.