Recoding using str_detect

GregRousell · February 12, 2018, 8:51pm

Hi all,

I'm trying to write a function that will recode a column using str_detect. I have 59 elementary schools that each have a unique ID number, however some of the extracts have only the school name. These are often inconsistent with capitalization, short forms (PS/P.S./Public School) so I'd like to recode using str_detect. Example:

x <- c("Fulton Elementary", "Warner Community School", "Bashford PS")
x <- tolower(x)

dplyr::recode (x, 
               "fulton elementary" = "1234",
               "warner community school" = "5678",
               "bashford ps" = "91011")

Returns:

[1] "1234"  "5678"  "91011"

What I would like is something like:

x <- dplyr::recode (x, str_detect (x, "fult") = "1234")

dracodoc · February 12, 2018, 9:21pm

Instead of mapping name to ID directly, I'll suggest to first normalize names into canonical form first, then it's trivial to link canonical form to ID number.

The difference in this method is that it's easy to see the input and your interpretation and find out any error.

It's also possible to do some smaller transformation steps, which is more flexible and can make the job much easier. For example you can replace all PS to Public School (all rules may have exception, so it's better to visual inspect the changes).

Normalizing input is common, you should be able to find some suggestions in this topic.

floresf · February 13, 2018, 1:29am

Hi @GregRousell,

Good advice from @dracodoc. In addition to that, take a look at case_when:

dplyr::case_when(str_detect(x, "fult") ~ "1234",
                 str_detect(x, "warn") ~ "5678",
                 TRUE ~ x)  # ... or any default value not specified on the condition set above

GregRousell · February 13, 2018, 2:40pm

Thanks. case_when is what I was looking for. Normalizing would be ideal but right now there are too many possible variations.