Tidily removing last piece of string after a space

AmeliaMN · December 2, 2019, 10:51pm

I have an aphorism that almost every time I reach for stringr what I really want is tidyr::separate. However, today I came up with what I thought was a simple example of string manipulation in class and ended up with a much more complicated solution than I would have liked. The idea is a variable with street names, and we want to pull off the "suffixes" (St, Dr, Ave, etc).

Here was my first (non-functional) attempt:

library(Stat2Data)
library(dplyr)
library(tidyr)
data("RailsTrails")
RailsTrails <- RailsTrails %>%
  separate(StreetName, into = c("name", "kind"), sep = " ", extra = "merge")

^{Created on 2019-12-02 by the reprex package (v0.2.1)}

This doesn't work, because of multi-word street names, so the first break goes into the first variable and the second two pieces are merged. Is there a way to modify this separate() call so it does what I want?

Since I couldn't figure that out, I tried using str_split() and ended up with this,

library(Stat2Data)
library(dplyr)
library(stringr)
library(purrr)
data("RailsTrails")
RailsTrails <- RailsTrails %>%
  mutate(pieces = str_split(StreetName, " ")) %>%
  mutate(last_piece = map_chr(pieces, ~.x[length(.x)]))

^{Created on 2019-12-02 by the reprex package (v0.2.1)}

This works, but it uses map_chr(), which I didn't intend to teach in this class! Any more elegant solutions out there?

martj42 · December 2, 2019, 11:06pm

I think stringr::word() should do the job:

RailsTrails <- RailsTrails %>%
  mutate(last_piece = word(StreetName, start = -1))

FJCC · December 2, 2019, 11:13pm

If you really want to use separate, this seems to work, though I think it is evil.

library(tidyr)
DF <- data.frame(Street = c("Main St", "Pretty Tree Dr", "Three Word Name Ave"))
DF2 <- DF %>% separate(col = "Street", into = c("Name", "Type"), 
                                            sep = " (?=[^ ]+$)", 
                                            remove = FALSE)
DF2
#>                Street            Name Type
#> 1             Main St            Main   St
#> 2      Pretty Tree Dr     Pretty Tree   Dr
#> 3 Three Word Name Ave Three Word Name  Ave

^{Created on 2019-12-02 by the reprex package (v0.2.1)}

sushmita · December 2, 2019, 11:15pm

library(Stat2Data)
library(dplyr)
library(tidyr)
data("RailsTrails")
RailsTrails %>% 
  select(StreetName) %>% 
  mutate(street_words = str_split(StreetName," ")) %>% 
  unnest() %>% 
  group_by(StreetName) %>% 
  filter(row_number()==n())

I guess this is circuitous but

str_split to split up words after ' '
unnest - gives long data
for each original string, index the words that compose it
keep last word

skaltman · December 3, 2019, 1:28am

What about extract()?

library(Stat2Data)
library(dplyr)
library(tidyr)
data("RailsTrails")

v <-
  RailsTrails %>% 
  extract(
    col = StreetName, 
    into = c("name", "kind"), 
    regex = "(.*) (.*)",
    remove = FALSE
  )

^{Created on 2019-12-02 by the reprex package (v0.2.1)}

The regex groups are "greedy", so the first one extracts everything it can, leaving just enough (one word after a space) for the second.

AmeliaMN · December 3, 2019, 4:46pm

Ooh, I didn't know about word()! I think that is probably the right solution for this class.

AmeliaMN · December 3, 2019, 4:47pm

Ooh, that's interesting! Thanks.

system · December 10, 2019, 4:47pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.