replacement argument in str_replace has differen length than string

Hi, I want to generate a code to automatically replace/clean names from one tibble (clean_db) based on the correct names from another tibble (Provincia). The code is the following:

str_replace(clean_db$provincia, paste(str_sub(clean_db$provincia, 1, 2),
".+",
str_sub(clean_db$provincia, -2),
sep = ""),
as.character(as_tibble(str_extract(Provincia$descripcion,
paste(str_sub(clean_db$provincia, 1, 2),
".+",
str_sub(clean_db$provincia, -2),
sep = ""))) %>% drop_na()))

The clean_db tibble has 532 obs. The thing is, the replacement argument picks all the strings in Provincia tibble, but what I need is to check the string in Provincia tibble (which is a 25 obs.) and just pick one string and then go to the next line of the clean_db tibble and do the same thing, so in this way the replacement just pick one string from Provincia tibble and then keep doing it for the entire clean_db tibble.

Thanks a lot,
KM

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

Hi, so sorry, I include the minimal datasets (clean_db and Provincia):

clean_db <- tibble(provincia = c("AZUY", "BOLI$BAR", "CAN_AR", "GUY$AS", "PICHI.CHA",
"COTPAXI", "MORON/A SANTIAGO"),
ciudad = c("QUITO", "CUENCA", "GUAYAQUIL", "MANTA", "PORTOVIEJO",
"AZOGUES", "SALINAS"))

Provincia <- tibble(codigo = c(1:17),
descripcion = c("AZUAY",
"BOLIVAR",
"CAÑAR",
"CARCHI",
"CHIMBORAZO",
"COTOPAXI",
"EL ORO",
"ESMERALDAS",
"GALAPAGOS",
"GUAYAS",
"IMBABURA",
"LOJA",
"LOS RIOS",
"MANABI",
"MORONA SANTIAGO",
"NAPO",
"SANTO DOMINGO DE LOS TSACHILAS"))

the code is

str_replace(clean_db$provincia, paste(str_sub(clean_db$provincia, 1, 2),
".+",
str_sub(clean_db$provincia, -2),
sep = ""),
as.character(as_tibble(str_extract(Provincia$descripcion,
paste(str_sub(clean_db$provincia, 1, 2),
".+",
str_sub(clean_db$provincia, -2),
sep = ""))) %>% drop_na()))

Regards,
KM

Trying to replace words based on the first two and last two characters doesn't seem like a reliable method, I think you should consider using string distance metrics like in this example:

library(tidyverse)
library(fuzzyjoin)

clean_db <- tibble(provincia = c("AZUY", "BOLI$BAR", "CAN_AR", "GUY$AS", "PICHI.CHA",
                                 "COTPAXI", "MORON/A SANTIAGO"),
                   ciudad = c("QUITO", "CUENCA", "GUAYAQUIL", "MANTA", "PORTOVIEJO",
                              "AZOGUES", "SALINAS"))

Provincia <- tibble(codigo = c(1:17),
                    descripcion = c("AZUAY",
                                    "BOLIVAR",
                                    "CAÑAR",
                                    "CARCHI",
                                    "CHIMBORAZO",
                                    "COTOPAXI",
                                    "EL ORO",
                                    "ESMERALDAS",
                                    "GALAPAGOS",
                                    "GUAYAS",
                                    "IMBABURA",
                                    "LOJA",
                                    "LOS RIOS",
                                    "MANABI",
                                    "MORONA SANTIAGO",
                                    "NAPO",
                                    "SANTO DOMINGO DE LOS TSACHILAS"))

clean_db %>% 
    stringdist_left_join(Provincia %>% select(descripcion),
                         by = c(provincia = "descripcion"),
                         method = "osa") %>% 
    mutate(provincia = coalesce(descripcion, provincia)) %>% 
    select(-descripcion)
#> # A tibble: 7 × 2
#>   provincia       ciudad    
#>   <chr>           <chr>     
#> 1 AZUAY           QUITO     
#> 2 BOLIVAR         CUENCA    
#> 3 CAÑAR           GUAYAQUIL 
#> 4 GUAYAS          MANTA     
#> 5 PICHI.CHA       PORTOVIEJO
#> 6 COTOPAXI        AZOGUES   
#> 7 MORONA SANTIAGO SALINAS

Created on 2022-03-30 by the reprex package (v2.0.1)

Or, if possible, manually define a vector with equivalences e.g. c('misspelling' = 'correct'), which would have the most accurate results.

Note: Next time please provide a proper REPRoducible EXample (reprex) illustrating your issue.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.