Turn off "global" matching in a regex

atiretoo · June 30, 2021, 4:29pm

I’m working with a public dataset that has state and county names separated by “-”

library(tidyverse)
test <- tribble(
  ~location,
  "Nebraska-Bray",
  "Nebraska-Box Butte",
  "Florida-Miami-Dade"
)

I want to separate this column into state and county.

separate(test, location, into = c("state","county"), remove = FALSE)
#> Warning: Expected 2 pieces. Additional pieces discarded in 2 rows [2, 3].
#> # A tibble: 3 x 3
#>   location           state    county
#>   <chr>              <chr>    <chr> 
#> 1 Nebraska-Bray      Nebraska Bray  
#> 2 Nebraska-Box Butte Nebraska Box   
#> 3 Florida-Miami-Dade Florida  Miami

The default regex throws away counties with spaces in the name.
But matching on ‘-’ still discards the Dade part of Miami-Dade.

separate(test, location, into = c("state","county"),  sep = "-", remove = FALSE)
#> Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].
#> # A tibble: 3 x 3
#>   location           state    county   
#>   <chr>              <chr>    <chr>    
#> 1 Nebraska-Bray      Nebraska Bray     
#> 2 Nebraska-Box Butte Nebraska Box Butte
#> 3 Florida-Miami-Dade Florida  Miami

Playing around on regex101.com, I discovered that if I removed the
global flag, I got exactly what I want. But I can’t figure out how
to do that in R. I see that I can control some regex options using
regex(), but global doesn’t seem to be one of them.

Turns out, Miami-Dade is the only county in the country with this problem,
so I can work around it. I’ve also figured out that I could replace
the first ‘-’, and then separate on that.

test %>% 
  mutate(location = str_replace(location, pattern = "(^[^-]*)-", replacement = "\\1_")) %>% 
  separate(location, into = c("state","county"),  sep = "_", remove = FALSE )
#> # A tibble: 3 x 3
#>   location           state    county    
#>   <chr>              <chr>    <chr>     
#> 1 Nebraska_Bray      Nebraska Bray      
#> 2 Nebraska_Box Butte Nebraska Box Butte 
#> 3 Florida_Miami-Dade Florida  Miami-Dade

But surely there is a better way? Or a way to control the global option?

^{Created on 2021-06-30 by the reprex package (v2.0.0)}

FJCC · June 30, 2021, 5:24pm

I think you can control this adequately with the extra argument of separate.

library(tibble)
library(tidyr)
test <- tribble(
  ~location,
  "Nebraska-Bray",
  "Nebraska-Box Butte",
  "Florida-Miami-Dade"
)
test
separate(test, location, into = c("State", "County"), sep = "-", extra = "merge")
# A tibble: 3 x 2
  State    County    
  <chr>    <chr>     
1 Nebraska Bray      
2 Nebraska Box Butte 
3 Florida  Miami-Dade

system · July 7, 2021, 5:24pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.