How to Improve classification Results

geordey · April 30, 2019, 3:37pm

I am currently conducting a study on the predictive qualities of odds (Regarding Football/Soccer). I have odds from multiple bookies on each of the seasons and leagues within the study ( as below ). Percentage of correctly predicted matches is rather low. Being in between 40-50%,sometimes even 30% and rarely going over 50%. Is there anything wrong with the code or within the data I am providing to the Decision tree that is causing such a low percentage ?

I have already tried k-fold cross validation and adding extra data such as elo ratings to no avail. I am excluding null values. Teams have been given both as factors and as dummy variables.

Structure of Data

    |-----|--------------|---------------|-------|-------|-------|
    | FTR |  Home Team   |  Away Team    |  BetH |  BetD |  BetA |
    |-----|--------------|---------------|-------|-------|-------|
    |  H  |   Chelsea    |   Liverpool   |  1.35 |  3.35 |  2.65 |

R Code


    DT1 <- x
    
    set.seed(123)
    DT1$FTR <- as.factor(DT1$FTR)
    
    DT1.rows <- nrow(DT1)
    DT1.sample <- sample(DT1.rows, DT1.rows * 0.6)
    
    DT1.train <- DT1[DT1.sample, ]
    DT1.test <- DT1[-DT1.sample, ]
    
    DT1.model <- C5.0(DT1.train[, -1], DT1.train$FTR, trails = 100)
    
    plot(DT1.model)
    summary(DT1.model)
    
    DT1.predict <- predict (DT1.model, DT1.test[, -1])
    CrossTable(
      DT1.test$FTR,
      DT1.predict,
      prop.c = FALSE,
      prop.r = FALSE,
      prop.chisq = FALSE
    )

technocrat · April 30, 2019, 4:14pm

A random model should average 50%, and the bookies can't lay odds that result in consistent losses. So, I gotta ask, what's your model? Is your dependent variable the percentage on the season as a whole (OLS regression) or is it the cumulative percentage of the outcomes of individual matches, in which case it should be GLM binomial.

geordey · April 30, 2019, 11:52pm

I am trying to see how accurate the odds are over the course of the entire season. For example. If there are 380 matches in a season, how many of them (%) is the model capable of classifying correctly using odds as information.

technocrat · May 1, 2019, 2:08am

See https://rafalab.github.io/dsbook/large-datasets.html#recommendation-systems

Short short:

Divide your data into a training set and a test set.
Use caret::train to train the training set and caret::predict to apply the trained model to the test set to see how well the odds and other factors predict the outcome of each match. RMSE is used as a metric, but I prefer confusion matrix tests that give accuracy.

system · May 22, 2019, 2:08am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.