Hi everybody,
I have a problem with my datas, here is an exemple of the presentation of my set :
structure(list(PatientID = c("P1", "P1", "P1",
"P2", "P3", "P3", "P4", "P5",
"P5", "P6"), LesionResponse = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L,
2L, 1L, 2L), .Label = c("0",
"1"), class = "factor"), pyrad_tum_original_shape_LeastAxisLength = c(19.7842995242803,
15.0703960571122, 21.0652247652897, 11.804125918871, 27.3980336338908,
17.0584330264122, 4.90406343942677, 4.78480430022189, 6.2170232078547,
5.96309532740722, 5.30141540007441), pyrad_tum_original_shape_Sphericity = c(0.652056853392657,
0.773719977240238, 0.723869070051882, 0.715122964970338,
0.70796498824535, 0.811937882810929, 0.836458991713367, 0.863337931630415,
0.851654860256904, 0.746212862162174), pyrad_tum_log.sigma.5.0.mm.3D_firstorder_Skewness = c(0.367453961973625,
0.117673346718817, 0.0992025164349288, -0.174029385779302,
-0.863570016875989, -0.8482193060411, -0.425424618080682,
-0.492420174157913, 0.0105111292451967, 0.249865833210199), pyrad_tum_log.sigma.5.0.mm.3D_glcm_Contrast = c(0.376932105256115,
0.54885738172596, 0.267158344601612, 2.90094719958076, 0.322424096161189,
0.221356030145403, 1.90012334870722, 0.971638740404501, 0.31547550396399,
0.653999340294952), pyrad_tum_wavelet.LHH_glszm_GrayLevelNonUniformityNormalized = c(0.154973213866752,
0.176128379241556, 0.171129002059539, 0.218343919352019,
0.345985943932352, 0.164905080489496, 0.104536489151874,
0.1280276816609, 0.137912385073012, 0.133420904484894), pyrad_tum_wavelet.LHH_glszm_LargeAreaEmphasis = c(27390.2818110851,
11327.7931034483, 51566.7948885976, 7261.68702290076, 340383.536555142,
22724.7792207792, 45.974358974359, 142.588235294118, 266.744186046512,
1073.45205479452), pyrad_tum_wavelet.LHH_glszm_LargeAreaLowGrayLevelEmphasis = c(677.011907073653,
275.281153810458, 582.131636238695, 173.747506476692, 6140.73990175018,
558.277670638306, 1.81042257642817, 4.55724031114589, 6.51794350173746,
19.144924585586), pyrad_tum_wavelet.LHH_glszm_SizeZoneNonUniformityNormalized = c(0.411899490603372,
0.339216399209913, 0.425584323452468, 0.355165782879786,
0.294934042125209, 0.339208410636982, 0.351742274819198,
0.394463667820069, 0.360735532720389, 0.36911240382811)), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"))
I need to split this dataset in train, validation and testing sets with the conservation of balanced datas like in the original dataset.
Secondly, you will see that there is multiple rows belonging to the same patient.
So i dit this in order to splitting my datas in three groups with 60% in train, 20% in each other groups :
set.seed(1234) # for reproducibility
g <- group_initial_split(df, group = PatientID, prop = 0.8)
train <- training(g)
test <- testing(g)
g <- group_initial_split(train, group = PatientID, prop = 3/4)
train <- training(g)
validation <- testing(g)
# check data split proportions
df_list <- list(train = train, validation = validation, test = test)
sapply(df_list, nrow)
#> train validation test
#> 646 203 203
But when I check the balance of the outcome, I find that there are very imbalanced datas in my validation set, with a 55/45 proportion, and a 85/15 in the test... How can I do ?