Problems with NA

I have many variables that for some reason were imported wrong from the base and the NA that were imported as the word "NA", both for numeric and character variables, resulting in all those variables being character.

I try to change it with:

df %>% 
mutate(across(where(is.character),
                  ~str_replace(., "NA",
                              NA)))

But I get this error:

Caused by error in `str_replace()`:` ``replacement` must be a character vector, not `NA`

And if I do :

df %>% 
` `mutate(across(where(is.character),
                  ~str_replace(., "NA",
                               NA_charac)))

it works but the numerical variables remain as character, making it impossible for me to analyze them.

Any solution?

EDIT:
I add a repex so I can show my problem, I made a simple tibble but imagine that in real life the problem is with more than 200 columns with this problem.

# A tibble: 15 × 4
   a     b     c     d    
   <chr> <chr> <chr> <chr>
 1 1     car   NA    NA   
 2 3     car   7     NA   
 3 2     bike  NA    NA   
 4 2     NA    NA    NA   
 5 2     NA    7     NA   
 6 2     NA    7     NA   
 7 3     bike  7     NA   
 8 1     NA    NA    blue 
 9 2     NA    7     NA   
10 NA    bike  NA    NA   
11 NA    bike  NA    red  
12 2     bike  6     red  
13 1     bike  6     NA   
14 NA    NA    6     red  
15 1     car   6     NA  

If you call as.numeric() over a vector like e.g. c("1", "2", "3", "NA"), then you'll get c(1, 2, 3, NA), so perhaps call that in your mutate()?

What happens with that solution is that the character variables that are indeed character variables end up with NA in values where there are valid answers.

You cannot have mixed data types in a column, so I'm assuming, that some columns contain values, which should be seen as numeric and others as characters?

To provide further assistance, can you give an example of your data? You can use the dput()-function for this purpose

Para converter para valores numéricos eu uso:

dados$coluna <- as.numeric(dados$coluna)

Mas vale lembrar que os valores númericos precisam ter como delimitador o "." e não a "," faça o

Substituindo vírgulas por pontos

dados$coluna <- gsub(",", ".", dados$coluna) e aí sim você pode usar o as.numeric na coluna

Hola, el problema es que haciendo eso pierdo los valores de las variables categoricas categorizadas correctamente.

Hi, I edited the post adding a reprex at the end so it is better understood.

#Removendo os valores NA
dados_selecionados <- dados_selecionados%>%na.omit()

Remova os valores NA antes

That's not a reprex, I cannot copy/paste that dataset into my session. If your data is called e.g. my_data, then run e.g. dput(head(my_data, 30)) and copy that into your question.

Having said that, I suggest you solve the problem earlier, by defining your NAs, when you read the data, e.g.:

readr::read_csv(file = "~/path/to/my/file.csv", na = c("", "NA", "-99", "_", "other_NAs..."))

See ?readr::read_csv for more info

Hi, sorry for the delay, here is the reprex:

structure(list(a = c("1", "3", "2", "2", "2", "2", "3", "1", 
"2", "NA", "NA", "2", "1", "NA", "1"), b = c("car", "car", "bike", 
"NA", "NA", "NA", "bike", "NA", "NA", "bike", "bike", "bike", 
"bike", "NA", "car"), c = c("NA", "7", "NA", "NA", "7", "7", 
"7", "NA", "7", "NA", "NA", "6", "6", "6", "6"), d = c("NA", 
"NA", "NA", "NA", "NA", "NA", "NA", "blue", "NA", "NA", "red", 
"red", "NA", "red", "NA")), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -15L))

It is a good solution that you gave me to solve it by defining the NA at the time of reading the base. But I tried it and although it recognizes the "NA" as NA it still keeps the numerical variables as character, that did not change at all and that is what I need to solve.

It's unclear where it goes wrong, when reading the data, but perhaps try this approach then:

clean <- function(x){
  x[which(x == "NA")] <- NA
  if( any(grepl("\\d+", x)) ){ x <- as.numeric(x) }
  return(x)
}
my_data |>
  dplyr::mutate(a = clean(a),
                b = clean(b))

Where my_data is your dput() output

Hello, thanks for taking the time to help me.
In the reprex that I published your solution works perfectly, the problem comes when I try to apply it to my real database. It works fine for many variables but there are some of the character types that don't work well and it returns all NA.
I make you a reprex with 3 variables that happens to them.

structure(list(a4_tag = c("Santa Rosa de Lima", "Alto Comedero", 
"Malvinas", "Malvinas", "Malvinas", "Vº Jardín de Reyes", "Vº Jardín de Reyes", 
"Santa Clara", "Loteo Don Emilio", "Asentamiento Alcobedo", "Alto Comedero", 
"Alto Comedero", "Santa Rosa", "Ejército del norte", "Vº Jardín de Reyes"
), barrio_tag = c("Santa Rosa de Lima", "18 Viviendas, Alto Comedero", 
"Malvinas", "Malvinas", "Malvinas", "Vº Jardín de Reyes", "Vº Jardín de Reyes", 
"Santa Clara", "Loteo Don Emilio", NA, "Alto Comedero", "Alto Comedero", 
"Santa Rosa", "Ejército del Norte", "Vº Jardín de Reyes"), 
    calle_tag = c("Guatemala", "Mzna. AP3 Lote 13", "Giochino", 
    "Ilegible", "Ilegible", "J. Cafrune", "J. Cafrune", "11 de Agosto", 
    "Congreso", "Sin dato", NA, NA, "137 Viviendas. Block \"L\"", 
    "Superí", "José María Ruiz S/N")), row.names = c(NA, -15L
), class = c("tbl_df", "tbl", "data.frame"))

You will see that it doesn't work.

Try this then:

clean <- function(x){
  x[which(x == "NA")] <- NA
  if( any(grepl("^\\d+$", x)) ){ x <- as.numeric(x) }
  return(x)
}

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.