cannot designate the predicted values into a new column due to missing values

am working on a dataset that contains several missing values. I have used a linear mixed model and I predicted BLUE Values and then I wanted to make a column and inserted these values in their corresponding location but there is an error which mentioned

“Error in `$<-.data.frame`(`*tmp*`, Predicted_BLUE, value = c(0.0162056671626986,  : 
  replacement has 49 rows, data has 65”

and this is due to that I have 16 missing values in my dataset(out of 65). Any idea how to overcome this problem? without eliminating the missing values from the model?

At the simplest level all you want to do is add a vector to a data.frame as a column. Since your prediction vector has fewer elements than your data.frame has rows, you need to ensure you are pairing the right prediction to the right data...

Here is a simple example which will illustrate what is happening:

df <- data.frame(y = 1:3, x = c(1, NA, 3))
z <- c(1, 3)
df$z <- z # error
df[!is.na(df$x), "z"] <- z # no error

So, what we've done is to simply insert into a subset of the the data.frame. By selecting only those rows for which x was not NA, we made sure the number of rows selected on the left matched the number of elements on the right. If you have multiple predictors you would need to to have no NA values in any of the columns for that row to be assigned a predicted value. Thankfully R has a simple function to help with that. You can do,

df[complete.cases(df), "z"] <- z

Missing values are often an issue. Ideally one goes back to the owner of the data and tries to get the values inserted, but this is often not possible. So the remaining choices are to eliminate observations having missing values, make up values, or possibly make misisng values its own category.

Somretimes missing values are made up using the mean or mode, or even random values from a distribution having the mean or mode of the remaining values.

I once saw missing values as its own category. The variable was gender. The analyst decided there might be something important about users who chose not to reveal their gender (or chose not to use a binary classification of gender), and so the analyst simply converted missing gender values to Other, and treated gender as having three factor values.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.