splitting datas with imbalanced outcome

Hi everybody,
I have a problem with my datas, here is an exemple of the presentation of my set :

structure(list(PatientID = c("P1", "P1", "P1", 
"P2", "P3", "P3", "P4", "P5", 
"P5", "P6"), LesionResponse = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 
    2L, 1L, 2L), .Label = c("0", 
    "1"), class = "factor"), pyrad_tum_original_shape_LeastAxisLength = c(19.7842995242803, 
    15.0703960571122, 21.0652247652897, 11.804125918871, 27.3980336338908, 
    17.0584330264122, 4.90406343942677, 4.78480430022189, 6.2170232078547, 
    5.96309532740722, 5.30141540007441), pyrad_tum_original_shape_Sphericity = c(0.652056853392657, 
    0.773719977240238, 0.723869070051882, 0.715122964970338, 
    0.70796498824535, 0.811937882810929, 0.836458991713367, 0.863337931630415, 
    0.851654860256904, 0.746212862162174), pyrad_tum_log.sigma.5.0.mm.3D_firstorder_Skewness = c(0.367453961973625, 
    0.117673346718817, 0.0992025164349288, -0.174029385779302, 
    -0.863570016875989, -0.8482193060411, -0.425424618080682, 
    -0.492420174157913, 0.0105111292451967, 0.249865833210199), pyrad_tum_log.sigma.5.0.mm.3D_glcm_Contrast = c(0.376932105256115, 
    0.54885738172596, 0.267158344601612, 2.90094719958076, 0.322424096161189, 
    0.221356030145403, 1.90012334870722, 0.971638740404501, 0.31547550396399, 
    0.653999340294952), pyrad_tum_wavelet.LHH_glszm_GrayLevelNonUniformityNormalized = c(0.154973213866752, 
    0.176128379241556, 0.171129002059539, 0.218343919352019, 
    0.345985943932352, 0.164905080489496, 0.104536489151874, 
    0.1280276816609, 0.137912385073012, 0.133420904484894), pyrad_tum_wavelet.LHH_glszm_LargeAreaEmphasis = c(27390.2818110851, 
    11327.7931034483, 51566.7948885976, 7261.68702290076, 340383.536555142, 
    22724.7792207792, 45.974358974359, 142.588235294118, 266.744186046512, 
    1073.45205479452), pyrad_tum_wavelet.LHH_glszm_LargeAreaLowGrayLevelEmphasis = c(677.011907073653, 
    275.281153810458, 582.131636238695, 173.747506476692, 6140.73990175018, 
    558.277670638306, 1.81042257642817, 4.55724031114589, 6.51794350173746, 
    19.144924585586), pyrad_tum_wavelet.LHH_glszm_SizeZoneNonUniformityNormalized = c(0.411899490603372, 
    0.339216399209913, 0.425584323452468, 0.355165782879786, 
    0.294934042125209, 0.339208410636982, 0.351742274819198, 
    0.394463667820069, 0.360735532720389, 0.36911240382811)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", 
"data.frame"))

I need to split this dataset in train, validation and testing sets with the conservation of balanced datas like in the original dataset.
Secondly, you will see that there is multiple rows belonging to the same patient.

So i dit this in order to splitting my datas in three groups with 60% in train, 20% in each other groups :

set.seed(1234) # for reproducibility

g <- group_initial_split(df, group = PatientID, prop = 0.8)
train <- training(g)
test <- testing(g)
g <- group_initial_split(train, group = PatientID, prop = 3/4)
train <- training(g)
validation <- testing(g)

# check data split proportions
df_list <- list(train = train, validation = validation, test = test)
sapply(df_list, nrow)
#>      train validation       test 
#>        646        203        203

But when I check the balance of the outcome, I find that there are very imbalanced datas in my validation set, with a 55/45 proportion, and a 85/15 in the test... How can I do ?

The results are random. With increasing size of the data, results will converge to equal proportions.

How would you do then ? I'm just a beginner and I think that I did bad for my selection...I was thinking of a loop.

Whether to use a loop or some different procedure is not helpful without first thinking through the problem that a training/test set division is designed to address. Given a set of observations, how well does a statistical test predict a set of observations yet to be made? There is no guarantee that any future set will have the same split between variables as the existing set. Accordingly, partitioning the data into train, validation and testing sets preserving the split among variables provides no additional information than simply overfitting by using the entire dataset.

In fact, my problem is the following one :
-I must predict the feature "LesionResponse" with others. It's a binary one.
-On the other hand, in the base dataset, I have a strong data imbalanced with like 70/30 in favor of the "1" compared to "0".
-Consequently, I though that I had to keep the same data proportions in the training, validation and test, because, if I make a model, he just has to tell me "1" at each prediction to have a 70% rate of success...
-But, in fact, I have another strong bias, a patient effect, bc many rows from my datas are from same patients, some have 10 rows, others 5, and others only 1...
-Thus, I though that I also have to separate my datas like this : Split my datas, stratifying on "LesionResponse", but keep all the rows with the same PatientID together, and to tell the computer a range of acceptable proportions I could accept in the final sets.
You think I am wrong ?
Sorry for the questions which can seems basic, but I'm a beginner in this field...

We all start at the beginning, and sometimes it takes getting a little lost to get on the right track.

What I have found very helpful is to cast a problem in terms of school algebra—f(x) = y, where

x is an object, such as a data frame, from which another object y such as a test statistic is to be derived through the application of a function object, f. Each of these may be, and often are, composite. For example x may be composed of many observations of many variables, y may contain not only a test statistic but confidence intervals, and f may involve separate applications of various functions to prepare x, nudge it closer step-by-step to y and possibly to format it for presentation.

Let's start with x in your case.

There is a response variable LesionResponse, which I'll call Y for short in a data frame, which I'll call d. The patient data I'll call id and one or more treatment variables (which are not necessarily medical treatments, although they may be, but variables that have also been measured) which I'll call X even though there may be X_1 \dots X_n.

The presence of identical id rows presents the problem of how to treat them. Consider the following

# fake data
# X patient identifier
# Y outcome
set.seed(42)
X <- sample(1:550,500,replace = TRUE)
Y <- sample(0:1,500,replace = TRUE)
500 - length(unique(X)) # number of simulated patients with one 
#> [1] 170
                        # or more duplicates
d <- data.frame(X = X, Y = Y)
d |> dplyr::arrange(X)  |> head(32)
#>     X Y
#> 1   2 0
#> 2   3 1
#> 3   3 1
#> 4   3 1
#> 5   8 1
#> 6  10 1
#> 7  11 0
#> 8  12 0
#> 9  13 0
#> 10 14 1
#> 11 16 1
#> 12 18 0
#> 13 20 0
#> 14 21 0
#> 15 22 0
#> 16 23 0
#> 17 24 1
#> 18 24 1
#> 19 25 1
#> 20 26 0
#> 21 27 1
#> 22 27 1
#> 23 30 0
#> 24 32 1
#> 25 33 0
#> 26 33 0
#> 27 37 1
#> 28 37 0
#> 29 37 1
#> 30 39 1
#> 31 40 0
#> 32 40 0

Created on 2023-02-17 with reprex v2.0.2

X substitutes for patient identifers, and they are not unique,. Y stands for the outcome, 1/0. Among the ntuples sometimes Y is all the same, sometime all different and sometimes mixed. For the ntuples, should one be selected, none or some? On what basis? Except when the outcome is mortality, perhaps, is there a relationship between a first and second result? Do we know the time order? What other variable might have changed to account for the difference in response? These effects compound difficulties in using a data set with multiple observations in the same set when using a split-train-test model evaluation. At a minimum "over-sampled" individuals will have a disproportionate effect on any estimate of the risk-ratio or other likelihood estimate of the probability to Y given XP(Y|X), biasing the results.

Alternative approaches, perhaps including mixed effects models, may be able to address this. However, I know too little about the overall structure of the data in terms of the number of variables, whether they are binary, ordinal, categorical or continuous and about mixed effects to offer any suggestion.

Thank you for this very detailed answer, it's very clear !!

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.