Machine learning with sparse features + missing data

Wout.dev · June 11, 2021, 6:18am

For a machine learning project, I am trying to predict the rental price and costs of real estate.
For this, I use rather a lot of dummy variables to capture differences based on geography, which makes the data rather sparse. Further, for the characteristics of the properties (number of bedrooms, bathrooms, whether there is a garden, ...) there is a lot of missing data.

Are there any good machine learning algorithms/packages in R that can handle this type of data?

Many Thanks.

Rsky · June 11, 2021, 8:37am

Hi @Wout.dev

use sparseMatrix

https://www.rdocumentation.org/packages/Matrix/versions/1.3-4/topics/sparseMatrix

Use mini-batch learning
Using the Sparse Model (LASSO)
Use a machine with huge memory in a cloud service
R is not very good at processing huge data on-memory.
(Please create functions like data loader to create sparse matrices only when processing.)
Synthesize the Low-dimensional features. For example, principal component analysis, tsne, umap, etc.

I'm a data analyst in a company, and I often use these in practice.

xgboost must use one hot vector.
More techniques may be found in the literature on xgboost.

Have a good R life.

Wout.dev · June 11, 2021, 11:37am

Hello,

Thank you for your reply.

arthur.t · June 11, 2021, 9:27pm

Sounds like a fun project.

One thing you should know is that R has a native factor data type. You shouldn't have to make dummy variables yourself. Look at this output and see how the dummies are created automatically. model.matrix(Sepal.Length ~ ., data = iris)

Lasso or elastic net are great methods for sparse data. Especially for a 1st pass modeling because they are fast to fit and easy to understand. Using caret::train, use method = "glmnet" argument.

Finally, if you are imputing missing values with a multivariate imputation method, make sure that you don't use the response variable as an imputation predictor.

system · July 2, 2021, 9:27pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.