Why does the number of columns reduce after using model.matrix in ridge regression?

Hi, i'm trying to do the House Price Prediction (dataset from Kaggle), i want to use the lasso régression and the redige régression. But the probleme is that after using model.matrix, the number of columns reduced. I've already checked my dataset, i don't have any NA value. So is there any other raison that this situation happen? Can some one please help me to resolve this probleme, thank you a lot !

Here it the link of this dataset : House Prices - Advanced Regression Techniques | Kaggle

Here is what i do to the train dataset and the same thing to the test dataset : I only use the numeric variables and removed the ones with a lot of NA values. I just want to practice lasso and ridge régression.

Then, here is what i do to transform my data frame to matrix.

train_x <- model.matrix(SalePrice~., data = PBdata[, -SalePrice])
train_y <- PBdata$SalePrice
dim(PBdata)
dim(train_x)

[1] 1259 34 [1] 1259 30

I would find out what variable are dropped and after that I could investigate further

setdiff(names(PBdata),
        names(train_x))

hi thank you for your response, i tried your code. I couldn't find any problem about the dropped variable, and even if i deleted these dropped variables, and tried again the transformation, there are still some dropped variables. So i'm totally confuses now.

They're the dropped variables at the first time :
"TotalBsmtSF""X1stFlrSF" "X2ndFlrSF" "LowQualFinSF"

Then i deleted them, and here're the dropped variables at the second time :
"GrLivArea" "BsmtFullBath" "BsmtHalfBath" "FullBath"

hi thank you for your response, i tried your code. I couldn't find any problem about the dropped variable, and even if i deleted these dropped variables, and tried again the transformation, there are still some dropped variables. So i'm totally confuses now.

They're the dropped variables at the first time :
"TotalBsmtSF""X1stFlrSF" "X2ndFlrSF" "LowQualFinSF"

Then i deleted them, and here're the dropped variables at the second time :
"GrLivArea" "BsmtFullBath" "BsmtHalfBath" "FullBath"

I'm looking at this and it seems very unusual syntax to me.
is PBdata a conventional data.frame or something else ?
The only way I can think that your code here might run is if SalePrice is not only a column in PBdata, but also some simple object that is maybe an integer vector, size 4, and it identifies 4 columns to drop each time you run this code.

Sale Price is a name of column in the PBdata, PBdata is a name that i define to a dataset.
So i should try to rename the Sale Price column, or try to find the object named by SalePrice and delete it ?

if you type

SalePrice

into the console and its something then you know thats probably the problem here
and

rm(SalePrice)

would remove it

1 Like

thank you so much , the probleme solved !

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.