Pattern matching: matching a string so that it ends right before the first comma

Ive this data frame with one colum including names of articles and their spezifikations...

# A tibble: 9,168 x 1
   ARTICLE                                                              
   <chr>                                                                     
 1 Article A, size 12, color yellow                 
 2 Article B, size 14, color red                   
 3 Article C, size 16, color yellow                 
 4 Article D, size 08, color yellow              
 5 Article E, size 08, color yellow            
 6 Article F, size 12, color green              
 7 Article G, size 10, color green               
 8 Article H, size 10, color yellow              
 9 Article I, size 14, color red                   
10 Article J, size 14, color blue                
# ... with 9,158 more rows

...but i want a data frame with a additional column including just the names of the articles without the spezifikations (see below). my question is: how do i match the string so that it ends right before the first comma?

# A tibble: 9,168 x 2
   ARTICLE                                          NEW                          
   <chr>                                            <chr>                            
 1 Article A, size 12, color yellow                 Article A
 2 Article B, size 14, color red                    Article B
 3 Article C, size 16, color yellow                 Article C
 4 Article D, size 08, color yellow                 Article D
 5 Article E, size 08, color yellow                 Article E
 6 Article F, size 12, color green                  Article F
 7 Article G, size 10, color green                  Article G
 8 Article H, size 10, color yellow                 Article H
 9 Article I, size 14, color red                    Article I
10 Article J, size 14, color blue                   Article J
# ... with 9,158 more rows

You can try positive look ahead, like this:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)
sample_data <- tibble(ARTICLE = c("Article A, size 12, color yellow", "Article B, size 14, color red", "Article C, size 16, color yellow", "Article D, size 08, color yellow", "Article E, size 08, color yellow", "Article F, size 12, color green", "Article G, size 10, color green", "Article H, size 10, color yellow", "Article I, size 14, color red", "Article J, size 14, color blue"))
sample_data %>%
    mutate(NEW = str_extract(string = ARTICLE,
                             pattern = "^.+?(?=,)"))
#> # A tibble: 10 x 2
#>    ARTICLE                          NEW      
#>    <chr>                            <chr>    
#>  1 Article A, size 12, color yellow Article A
#>  2 Article B, size 14, color red    Article B
#>  3 Article C, size 16, color yellow Article C
#>  4 Article D, size 08, color yellow Article D
#>  5 Article E, size 08, color yellow Article E
#>  6 Article F, size 12, color green  Article F
#>  7 Article G, size 10, color green  Article G
#>  8 Article H, size 10, color yellow Article H
#>  9 Article I, size 14, color red    Article I
#> 10 Article J, size 14, color blue   Article J

Created on 2020-02-25 by the reprex package (v0.3.0)

Hope this helps.

Edit

@anon73295571, here are some explanation:

  1. ^ - beginning of the string.
  2. .+? - any character one or more times (but lazy evaluation)
  3. (?=,) - checks that next character is a comma, but this , is not part of the match.

You can explore regExr and Regex Tutorial - Lookahead and Lookbehind.

1 Like

this definitely solves my problem although this pattern

is not traceable for me, but i will investigate.

You can try the following

table%>% 
+ mutate(new_col = str_split(ARTICLE, ",", simplify = TRUE)[, 1])

Thanks!
Heramb

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.