Exploratory Data Analysis for Big Data with continuous and categorical variables (mixed data types)

I have a dataset with 28 variables (6 continuos + 22 categorical).

As you can see below, some of the categorical / factor variables have several levels.

> str(myds)
'data.frame':	841500 obs. of  28 variables:
 $ score                     : num  0 0 0 0 0 0 0 0 0 0 ...
 $ amount_sms_received       : int  0 0 0 0 0 0 3 0 0 3 ...
 $ amount_emails_received    : int  3 36 3 12 0 63 9 6 6 3 ...
 $ distance_from_server      : int  17 17 7 7 7 14 10 7 34 10 ...
 $ age                       : int  17 44 16 16 30 29 26 18 19 43 ...
 $ points_earned             : int  929 655 286 357 571 833 476 414 726 857 ...
 $ registrationDateMM        : Factor w/ 9 levels ...
 $ registrationDateDD        : Factor w/ 31 levels ...
 $ registrationDateHH        : Factor w/ 24 levels ...
 $ registrationDateWeekDay   : Factor w/ 7 levels ...
 $ catVar_06                 : Factor w/ 140 levels ...
 $ catVar_07                 : Factor w/ 21 levels ...
 $ catVar_08                 : Factor w/ 1582 levels ...
 $ catVar_09                 : Factor w/ 70 levels ...
 $ catVar_10                 : Factor w/ 755 levels ...
 $ catVar_11                 : Factor w/ 23 levels ...
 $ catVar_12                 : Factor w/ 129 levels ...
 $ catVar_13                 : Factor w/ 15 levels ...
 $ city                      : Factor w/ 22750 levels ...
 $ state                     : Factor w/ 55 levels ...
 $ zip                       : Factor w/ 26659 levels ...
 $ catVar_17                 : Factor w/ 2 levels ...
 $ catVar_18                 : Factor w/ 2 levels ...
 $ catVar_19                 : Factor w/ 3 levels ...
 $ catVar_20                 : Factor w/ 6 levels ...
 $ catVar_21                 : Factor w/ 2 levels ...
 $ catVar_22                 : Factor w/ 4 levels ...
 $ catVar_23                 : Factor w/ 5 levels ...

My question is: Given the current situation, how can I do the data exploration with them in order to decide whether include them on the model or not (I will use Neural Networks).

When the number of level for categorical variables is low I can use tools like:

  • DataExplorer::plot_correlation()
  • Gower / cluster::daisy()

But my problem here is that the amount of data is huge.

Thanks!

1 Like

I would recommend using the CRISP-DM https://www.datasciencecentral.com/profiles/blogs/crisp-dm-a-standard-methodology-to-ensure-a-good-outcome , http://www.proglobalbusinesssolutions.com/six-steps-in-crisp-dm-the-standard-data-mining-process/

Data Preparation is where you will spend most of your time data cleaning, transformation, reduction, balancing, sampling or other advanced data preparation methods.

What is your predictor variable/class? The algorithm to use will depending on the business problem and other factors such as performance(eg how fast it takes to build model with cross-validation) and easy readability of the model.

For what attributes to use, there are different attribute selection techniques, you can use that will help rank(Gain Ration) or select attributes for you.

Last but not least, know your data, it no domain knowledge research about dataset.

I tried to keep the answer really high level without going into details the tools to use because that is up to you.

Hope this helps

Thank you!

Data Preparation is already done, and yes, we spent lot of time on that trying to select the most interesting features for our problem. The current dataset has no missing values. Reductions, feature transformation, value imputations, outlier processings, etc, were done.

The variable to predict is: score which as you can see above, it is continuous.

This is why I posted the current thread, because trying to figure out what attributes are the most interesting.

We are totally familiar with our data. So, if at some point we need to take some decision based on the knowledge about it, that won't be a problem.

Thank you @Kill3rbee

Still looking for something more specific...

Getting a firm grasp on categorical features, especially when they have high cardinality, is difficult with regular exploratory data analysis approaches. Often, with a data set like this, I will do some exploratory work with a random forest model.

Random forests work very naturally with categorical features and using various interpretation methods (i.e. variable importance, PDPs) you should be able to identify which features seem to have a signal and which do not.

This also allows you to see the impact of different feature engineering approaches. A neural net is going to require you to convert your features into numeric values one way or another (i.e. one-hot encode, ordinal encode). With a random forest you can easily compare the impact these encodings have on performance and identifying signals. For example, if there's any order to some of your categorical features then ordinal encoding should improve your RF. If there is no order, then compare how label encoding vs. one-hot encoding impacts performance. This will be importance because you have high cardinality in some of your features (i.e. catVar_08) and one-hot encoding these features will explode the number of parameters your neural net has...which can have computational and performance impacts.

Random forests are also used to automate the feature selection process. For example, check out the recursive feature elimination functionality provided by the caret package and also the Boruta algorithm for feature selection.

1 Like

Thank you Bradley for your suggestion. I will definitely start reading that guide you suggested.

Just in case I will keep this thread open for 2 more days before close it as Solved. Maybe I get another interesting suggestion like yours.

Thanks!

1 Like

@tlg265, never rush to pick an algorithm to use upfront. You need to perform what is called experiments. This allows you test multiple algorithms and pick one based on business requirement and technology infrastructure.

Some people will rush to say, use neural networks, stacking, ensemble, random forest, etc. Do not do that, because the type of data sometimes may require different algorithms.

I love R and I use it together with python. However, I took this graduate certificate and I learned a lot about proper data mining and selection of algorithm. I learned Weka and played around with Orange.

Weka allows you to focus on really cleaning data, and finding best algorithm without worry about programming. After deciding on algorithm, you can always use R or python for automation or just use model from Weka.

This guy here has good tutorials I found easy to understand.

You want to develop a process that reproducible in building your model. Is your dataset balanced, how did you test for that? If your data is not sensitive, random sample about 10000 observations put a link for that dataset as csv and I will play around and see what algorithm will be the best.

Never just randomly pick an algorithm, you must show why you pick that algorithm.
Thanks

1 Like

That's a good point, thank you!

Amazing!, I think Weka will have a new user here from now. It looks really promising tool.

Looks like the right path.

Unfortunately my data is sensitive. But if I have a question at some point I will ask here.

I think this will become my rule of thumb.

Thanks!

@tlg265 here is a link to Weka training and documentation and all the nice stuff.

Also buy this book if you can. Link is also available on the main link Book tab.

Good luck and keep us posted

1 Like

Thank you guys for your advices!

Just in case, I created another topic with a question about the same dataset above.

I put a reference here because maybe it enrich the knowledge base of the forum linking related posts.

It is about: What would be a good threshold number to convert some categories to Not Specified?

Thanks!