Selecting suitable supervised learning algorithm


#1

i want to predict e-customer purchase behavior that either e-customer will buy an item or not. the training dataset i am using for this purpose containing 40596053 number of records.
I need a help in table form comparing supervised learning algorithms as a review activity that which one algorithm effectively solve my problem? any help will be really appreciated.


#2

That is a very wide question and the answer unfortunately is not simple, but very much dependent on your particular case, aim and priorities.

So, a two-class classifier. I would recommend creating a base learner using a logistic regression, then up the complexity with random forest and then finally you can try a neural network. For each step, you should record the performance and then you need to outweigh model performance versus model interpretability. Be careful with over-fitting, consider cross validation and think about how you record your performance and also which measure you use.

For consistent model comparisons, I would recommend looking into the mlr package, you can take a look at the official tutorial.

I hope this gets you started and good luck :slightly_smiling_face:


#3

thank you so much but my dataset is very large and logistic regression and Random Forest are not a good choice because the algorithms work not better with large dataset. Logistic Regression work well with a small dataset and Random Forest takes too much time and memory space to generate a model and i have only 8 Gb RAM installed on my system


#4

No problem - Then split your data set into e.g. 10 chunks, use each separately to build each of the before-mentioned models and finally, combine the 10 models to create an ensemble model. You can also run feature importance algorithms and exclude non-informative variables to reduce the size of your data set and/or use dimensionality reduction techniques.

Finally, if you henceforth are going to work on really big data sets - Get. more. ram. :wink:


#5

oh thank you soooooo much sir :blush:


#6

Hi. How many variables or dimensions in your data set?


#7

dim(mydata)
[1] 40596053 8


#8

the future work of thesis i am following did the same work with random forest and logistic regression and suggest svm and nn. i tried nn but got errors so suggest any other algorithm sir


#9

dim(mydata)
[1] 40596053 8


#10

In order to get further help, you will need to supply a reproducible example.