Machine learning with sparse features + missing data

For a machine learning project, I am trying to predict the rental price and costs of real estate.
For this, I use rather a lot of dummy variables to capture differences based on geography, which makes the data rather sparse. Further, for the characteristics of the properties (number of bedrooms, bathrooms, whether there is a garden, ...) there is a lot of missing data.

Are there any good machine learning algorithms/packages in R that can handle this type of data?

Many Thanks.

1 Like


  • use sparseMatrix

  • Use mini-batch learning

  • Using the Sparse Model (LASSO)

  • Use a machine with huge memory in a cloud service

  • R is not very good at processing huge data on-memory.
    (Please create functions like data loader to create sparse matrices only when processing.)

  • Synthesize the Low-dimensional features. For example, principal component analysis, tsne, umap, etc.

I'm a data analyst in a company, and I often use these in practice.

xgboost must use one hot vector.
More techniques may be found in the literature on xgboost.

Have a good R life.


Thank you for your reply.

Sounds like a fun project.

One thing you should know is that R has a native factor data type. You shouldn't have to make dummy variables yourself. Look at this output and see how the dummies are created automatically. model.matrix(Sepal.Length ~ ., data = iris)

Lasso or elastic net are great methods for sparse data. Especially for a 1st pass modeling because they are fast to fit and easy to understand. Using caret::train, use method = "glmnet" argument.

Finally, if you are imputing missing values with a multivariate imputation method, make sure that you don't use the response variable as an imputation predictor.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.