Machine learning with sparse features + missing data

For a machine learning project, I am trying to predict the rental price and costs of real estate.
For this, I use rather a lot of dummy variables to capture differences based on geography, which makes the data rather sparse. Further, for the characteristics of the properties (number of bedrooms, bathrooms, whether there is a garden, ...) there is a lot of missing data.

Are there any good machine learning algorithms/packages in R that can handle this type of data?

Many Thanks.

1 Like

Hi @Wout.dev

  • use sparseMatrix

https://www.rdocumentation.org/packages/Matrix/versions/1.3-4/topics/sparseMatrix

  • Use mini-batch learning

  • Using the Sparse Model (LASSO)

  • Use a machine with huge memory in a cloud service

  • R is not very good at processing huge data on-memory.
    (Please create functions like data loader to create sparse matrices only when processing.)

  • Synthesize the Low-dimensional features. For example, principal component analysis, tsne, umap, etc.

I'm a data analyst in a company, and I often use these in practice.

xgboost must use one hot vector.
More techniques may be found in the literature on xgboost.

Have a good R life.

Hello,

Thank you for your reply.

Sounds like a fun project.

One thing you should know is that R has a native factor data type. You shouldn't have to make dummy variables yourself. Look at this output and see how the dummies are created automatically. model.matrix(Sepal.Length ~ ., data = iris)

Lasso or elastic net are great methods for sparse data. Especially for a 1st pass modeling because they are fast to fit and easy to understand. Using caret::train, use method = "glmnet" argument.

Finally, if you are imputing missing values with a multivariate imputation method, make sure that you don't use the response variable as an imputation predictor.