Symbol deletion and formatting

Hello; I'm facing some issues with the following:

  1. One variable in my data set has several rows with the symbol "?". I want to delete all the rows where this variable = "?". I have tried: data[data$customer_id != "?", ] and data2 <- subset (data, customer_id !="?") but it didn't work.
  2. The same variable customer_id is composed in some instances with numerical values and some other cases with both numerical and character values. As an example: B32598 and 254879. When I do STR(data), it's accounted for as Factor and I want to change it to character. I have used: data$customer_id_adj <- as.character(data$customer_id.factor) but id didn't work. Any suggestions?

Thank you!

It will be easiest for folks to help you if you create a reproducible example (reprex).

A few thoughts:

Regarding #1
If attempting to filter for equality with "?" doesn't seem to work, my guess is that there is other data in the field which is keeping is making the conditional always true. This could be whitespace characters like spaces or tabs or non-printable characters. If you find a row with a "?" in it you could see how many characters are using nchar function.

Regarding #2
It depends on exactly what you want, but you could ensure that column is a character column to being with by using the stringsAsFactors = FALSE is used when you're reading data into your dataframe.

Can you give an example of your current code?

This seems to work for me:

#Load dataframe with stringsAsFactors = F
test = data.frame(rowId = 1:5, customer_id = c("B32598", "254879", "?", "B12768", "?"), stringsAsFactors = F)
#Or convert afterwards
test = data.frame(rowId = 1:5, customer_id = c("B32598", "254879", "?", "B12768", "?"))
test$customer_id = as.character(test$customer_id)
#Subset
test = test[test$customer_id != "?",]

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.