Getting nth match of string with regular expression and stringr. General string manipulation.

Hi, amazing RStudio Community!

Please find below a reprex of my "issue". But first, let me explain:

  1. In my_tbl I have a column with text - of which the important parts are the numbers.
  2. I want to manipulate that column such that I get a factor with just the numbers as shown on target_tbl.

I was able to do this, but with lots of steps and my guess is that there's a much better and faster way.

An important question I have is: how to extract the nth occurrence of a match with regex and stringr?

library(tidyverse)
library(reprex)

# --- My data ---
my_tbl <- tribble(
  ~ my_text,
  "1. Up to 15 USD",
  "2. More than 15 USD and up to 50 USD",
  "3. More than 50 USD and up to 100 USD",
  "4. More than 100 USD and up to 250 USD",
  "5. More than 250 USD and up to 500 USD",
  "6. More than 500 USD"
)

# --- Desired output ---
target_tbl <- tribble(
  ~ my_text,
  "<=15",
  "+15-50",
  "+50-100",
  "+100-250",
  "+250-500",
  "> 500"
) %>% mutate(my_text = as_factor(my_text))


# --- my attempt ---

my_tbl %>% 
  mutate(
    # Extract two or more digits. Concatenate a "+" at the beginning.
    first_digits = str_extract(my_text, "\\d{2,}") %>% str_c("+", .),
    
    # I want to extract the second group of digits, but have to remove the first one first
    second_digits = str_remove(my_text, "\\d{2,}") %>% str_extract("\\d{2,}")
  ) %>% 
  
  # Then join the digits and separate by "-"
  unite(col = "my_factor", first_digits, second_digits, sep = "-") %>% 
  
  # Change "NA" cases
  mutate(my_factor = case_when(my_factor == "+15-NA" ~ "<=15",
                               my_factor == "+500-NA" ~ "> 500",
                               TRUE ~ my_factor),
         
         # Convert to factor
         my_factor = as_factor(my_factor)
         )
#> # A tibble: 6 x 2
#>   my_text                                my_factor
#>   <chr>                                  <fct>    
#> 1 1. Up to 15 USD                        <=15     
#> 2 2. More than 15 USD and up to 50 USD   +15-50   
#> 3 3. More than 50 USD and up to 100 USD  +50-100  
#> 4 4. More than 100 USD and up to 250 USD +100-250 
#> 5 5. More than 250 USD and up to 500 USD +250-500 
#> 6 6. More than 500 USD                   > 500

Created on 2020-10-18 by the reprex package (v0.3.0)

If anyone had any input on this... I'd appreciate it very much.

Thank you in advance!
Alexis

Something like this, may be? Not sure about better and faster part.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(forcats)
library(stringr)

my_tbl <- tribble(
    ~ my_text,
    "1. Up to 15 USD",
    "2. More than 15 USD and up to 50 USD",
    "3. More than 50 USD and up to 100 USD",
    "4. More than 100 USD and up to 250 USD",
    "5. More than 250 USD and up to 500 USD",
    "6. More than 500 USD"
)

target_tbl <- tribble(
    ~ my_text,
    "<=15",
    "+15-50",
    "+50-100",
    "+100-250",
    "+250-500",
    "> 500"
) %>% mutate(my_text = as_factor(my_text))

my_tbl %>%
    mutate(lower_bound = str_extract(string = my_text,
                                     pattern = "(?<=More than )\\d{2,}"),
           upper_bound = str_extract(string = my_text,
                                     pattern = "(?<=[uU]p to )\\d{2,}"),
           intervals = factor(x = case_when(is.na(x = lower_bound) ~ str_c("<=", upper_bound),
                                            is.na(x = upper_bound) ~ str_c("> ", lower_bound),
                                            TRUE ~ str_c("+", lower_bound, "-", upper_bound)),
                              levels = c("<=15", "+15-50", "+50-100", "+100-250", "+250-500", "> 500")))
#> # A tibble: 6 x 4
#>   my_text                                lower_bound upper_bound intervals
#>   <chr>                                  <chr>       <chr>       <fct>    
#> 1 1. Up to 15 USD                        <NA>        15          <=15     
#> 2 2. More than 15 USD and up to 50 USD   15          50          +15-50   
#> 3 3. More than 50 USD and up to 100 USD  50          100         +50-100  
#> 4 4. More than 100 USD and up to 250 USD 100         250         +100-250 
#> 5 5. More than 250 USD and up to 500 USD 250         500         +250-500 
#> 6 6. More than 500 USD                   500         <NA>        > 500

Created on 2020-10-18 by the reprex package (v0.3.0)

1 Like

here's one option using str_extract_all

my_tbl %>% rowwise() %>%
  mutate(
    # Extract two or more digits. Concatenate a "+" at the beginning.
    diglist = str_extract_all(my_text, "\\d{2,3}"),
    first = head(diglist,1),
    second = ifelse(length(diglist)>1,tail(diglist,1),NA))

# A tibble: 6 x 4
# Rowwise: 
  my_text                                diglist   first second
  <chr>                                  <list>    <chr> <chr> 
1 1. Up to 15 USD                        <chr [1]> 15    NA    
2 2. More than 15 USD and up to 50 USD   <chr [2]> 15    50    
3 3. More than 50 USD and up to 100 USD  <chr [2]> 50    100   
4 4. More than 100 USD and up to 250 USD <chr [2]> 100   250   
5 5. More than 250 USD and up to 500 USD <chr [2]> 250   500   
6 6. More than 500 USD                   <chr [1]> 500   NA
1 Like

you're right. There's no principled way to distinguish the 1st row from the 6th in terms on the 2nd position number representing the appropriate intended meaning, merely by position

Is stringr a requirement for you? I haven't put as much effort into learning stringr as maybe I should, but I think this can be done more simply using the base R functions gregexpr and regmatches

Edit
I went ahead and put together an answer anyway, and was basically writing the same as andresrcs wrote.

But if the regex is essential - ie you wnat the sting indices of matches then gregexpr / regexec will give this (for the first element). The 10 is the start location of the match (for "15"), and the 2 gives the length of the match.

The actual match values can then be extracted with regmatches

gregexpr('\\d{2,}', my_tbl$my_text)[[1]]

[[1]]
[1] 10
attr(,"match.length")
[1] 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

1 Like

Also, I don't think this is the kind of task that requires the use of regular expressions, usually, you don't have as many levels on a factor variable to justify the effort, Why not to simply make a one on one replacement?

library(tidyverse)

my_tbl <- tribble(
    ~ my_text,
    "1. Up to 15 USD",
    "2. More than 15 USD and up to 50 USD",
    "3. More than 50 USD and up to 100 USD",
    "4. More than 100 USD and up to 250 USD",
    "5. More than 250 USD and up to 500 USD",
    "6. More than 500 USD"
)

replace <- c(
    "1. Up to 15 USD" = "<=15",
    "2. More than 15 USD and up to 50 USD" = "+15-50",
    "3. More than 50 USD and up to 100 USD" = "+50-100",
    "4. More than 100 USD and up to 250 USD" = "+100-250",
    "5. More than 250 USD and up to 500 USD" = "+250-500",
    "6. More than 500 USD" = "> 500"
)

my_tbl %>% 
    mutate(my_text = str_replace_all(my_text, replace))
#> # A tibble: 6 x 1
#>   my_text 
#>   <chr>   
#> 1 <=15    
#> 2 +15-50  
#> 3 +50-100 
#> 4 +100-250
#> 5 +250-500
#> 6 > 500

Created on 2020-10-18 by the reprex package (v0.3.0)

2 Likes

Thank you all so much for the input! This has been amazing, I've learned a lot from you people and this will certainly help others that have similar questions to mine.

Best,
Alexis