generate new balanced data by rose

Hello Everyone,

I'm trying to generate a new balanced data using ROSE in RSTUDIO.

But is giving an error that I don't understand:

Error in str2lang(x) : :1:299: unexpected input
1: nder_Recode+Employee_ID+Nationality+Gender+Date_Birth+Age_Entrie+Age_Employee+Length_Time_Year+Length_Time_Month+Admission_Date+Admisson_Year+Dismissal_Date+Year_Demisson+Job+Tier+Rank+Business

I already googled for solutions and examples but didn't find. Can someone help me?

Thanks in advance.

Can you provide a reproducible example? FAQ: How to do a minimal reproducible example ( reprex ) for beginners

1 Like

Thanks for the tutorial it was very helpful.

I think it is this right?

rcommunity <- tibble::tribble(
    ~Ative_Inactive,  ~STATUS, ~Gender_Recode, ~Employee_ID, ~Nationality, ~Gender,  ~Date_Birth, ~Age_Entrie, ~Age_Employee, ~Length_Time_Year, ~Length_Time_Month, ~Admission_Date, ~Admisson_Year, ~Dismissal_Date, ~Year_Demisson,   ~Job,  ~Tier,  ~Rank,    ~Business, ~Client, ~Costumer, ~ID_BUM,    ~BUM, ~ID_Manager,      ~Manager, ~Manager_Recode, ~`Rate/_Hour`, ~Annual_Gross_Salary,
                 0L, "ACTIVE",             1L,           3L,         "PT",  "Male", "29/04/1980",         29L,           40L,               11L,                 1L,    "04/01/2010",          2010L,              NA,             NA, "eefv",  "edf", "defg", "Consulting",   "nbm",     "ktb",    917L, "Carol",     "Colle", "Colleague_1",     "Colleague",   "36,00 €",      "39 657,34 €",
                 0L, "ACTIVE",             1L,          17L,         "PT",  "Male", "18/04/1976",         34L,           44L,               10L,                 7L,    "05/07/2010",          2010L,              NA,             NA, "efef", "wdvf", "defe", "Consulting",   "dth",     "nbv",    189L,  "Rick",     "Colle", "Colleague_2",        "Ashley",   "32,50 €",      "33 637,53 €"
    )

head(rcommunity)
#> # A tibble: 2 x 28
#>   Ative_Inactive STATUS Gender_Recode Employee_ID Nationality Gender Date_Birth
#>            <int> <chr>          <int>       <int> <chr>       <chr>  <chr>     
#> 1              0 ACTIVE             1           3 PT          Male   29/04/1980
#> 2              0 ACTIVE             1          17 PT          Male   18/04/1976
#> # ... with 21 more variables: Age_Entrie <int>, Age_Employee <int>,
#> #   Length_Time_Year <int>, Length_Time_Month <int>, Admission_Date <chr>,
#> #   Admisson_Year <int>, Dismissal_Date <lgl>, Year_Demisson <lgl>, Job <chr>,
#> #   Tier <chr>, Rank <chr>, Business <chr>, Client <chr>, Costumer <chr>,
#> #   ID_BUM <int>, BUM <chr>, ID_Manager <chr>, Manager <chr>,
#> #   Manager_Recode <chr>, `Rate/_Hour` <chr>, Annual_Gross_Salary <chr>

Thanks, and can you share the code you applied to the example data (that triggered the error you shared) ?

1 Like

## Create separate variable for terminations

emp$resigned <- ifelse(emp$STATUS == "TERMINATED", "Yes", "No")

## Convert to factor (from character)
emp$resigned <- as.factor(emp$resigned)
summary(emp$resigned)



## Subset the data again into train & test sets considering the admisson year. 

emp_train <- subset(emp, Admisson_Year < 2019)
emp_test <- subset(emp, Admisson_Year > 2019)

## "Random Over Sampling Examples"; generates synthetic balanced samples

library(ROSE)

emp_train_rose <- ROSE(resigned ~ ., data = emp_train, seed=125)$data


> ERROR:
    
    Error in str2lang(x) : <text>:1:299: unexpected input
1: nder_Recode+Employee_ID+Nationality+Gender+Date_Birth+Age_Entrie+Age_Employee+Length_Time_Year+Length_Time_Month+Admission_Date+Admisson_Year+Dismissal_Date+Year_Demisson+Job+Tier+Rank+Business

Is emp_term_train supposed to be emp_train?

Sorry it was my mistake.

I was trying to find a solution and I copied and pasted the wrong one, I edited and putted the correct one:

emp_train_rose <- ROSE(resigned ~ ., data = emp_train, seed=125)$data

My goal it is to generate synthetic balanced samples because I have an imbalanced dataset.

I have 0.7950 to 0.2041, so I need to create a balance one

I changed all columns to factor and numeric.
And tried to run again ROSE

But now I'm getting this error:

"Error in omnibus.balancing(formula, data, subset, na.action, N, p, method = "rose", :
The response variable has only one class."

I already searched but once again I'm not finding a solution.

We cant use your code and data and expect ROSE to work.
there are only two example rows of your data and they both have same value of status, i.e. of resigned i.e. nothing for ROSE to balance

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.