I’m working with a public dataset that has state and county names separated by “-”
library(tidyverse)
test <- tribble(
~location,
"Nebraska-Bray",
"Nebraska-Box Butte",
"Florida-Miami-Dade"
)
I want to separate this column into state and county.
separate(test, location, into = c("state","county"), remove = FALSE)
#> Warning: Expected 2 pieces. Additional pieces discarded in 2 rows [2, 3].
#> # A tibble: 3 x 3
#> location state county
#> <chr> <chr> <chr>
#> 1 Nebraska-Bray Nebraska Bray
#> 2 Nebraska-Box Butte Nebraska Box
#> 3 Florida-Miami-Dade Florida Miami
The default regex throws away counties with spaces in the name.
But matching on ‘-’ still discards the Dade part of Miami-Dade.
separate(test, location, into = c("state","county"), sep = "-", remove = FALSE)
#> Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].
#> # A tibble: 3 x 3
#> location state county
#> <chr> <chr> <chr>
#> 1 Nebraska-Bray Nebraska Bray
#> 2 Nebraska-Box Butte Nebraska Box Butte
#> 3 Florida-Miami-Dade Florida Miami
Playing around on regex101.com, I discovered that if I removed the
global flag, I got exactly what I want. But I can’t figure out how
to do that in R. I see that I can control some regex options using
regex()
, but global doesn’t seem to be one of them.
Turns out, Miami-Dade is the only county in the country with this problem,
so I can work around it. I’ve also figured out that I could replace
the first ‘-’, and then separate on that.
test %>%
mutate(location = str_replace(location, pattern = "(^[^-]*)-", replacement = "\\1_")) %>%
separate(location, into = c("state","county"), sep = "_", remove = FALSE )
#> # A tibble: 3 x 3
#> location state county
#> <chr> <chr> <chr>
#> 1 Nebraska_Bray Nebraska Bray
#> 2 Nebraska_Box Butte Nebraska Box Butte
#> 3 Florida_Miami-Dade Florida Miami-Dade
But surely there is a better way? Or a way to control the global option?
Created on 2021-06-30 by the reprex package (v2.0.0)