trouble with extracting number from a string.

Matthias · March 19, 2021, 1:10pm

I have a column that is most often a number but sometimes might contain some text wrapped around the numbers.
I managed to remove the text but I am not sure if other text entities might occur, so I thought it's easier to just extract the numbers.
Strangely this works as long as I don't have multiple digits after the decimal point. Maybe I miss something?

Thanks for your help!

library(stringr)
Area = c("saturated( 790887469.345 )",
         "saturated( 790887469.3 )",
         "saturated( 790887469 )", 
         "790887469.345", 
         "790887469.3", 
         "790887469") 
         
str_extract(Area, "\\d*")
#[1] ""          ""          ""          "790887469" "790887469" "790887469"
# misses the results with the additional text and brackets
# misses the digits after the . (as expected)

str_extract(Area, "\\d*\\.*\\d")
#[1] "790887469.3" "790887469.3" "790887469"   "790887469.3" "790887469.3" "790887469"
# correctly extracts all to the first digit after the "."

#okay so far so good, just allow more digits!
str_extract(Area, "\\d*\\.*\\d*")
# [1] ""              ""              ""              "790887469.345" "790887469.3"   "790887469"
# What? 
# correctly extracts all digits but misses the results in brackets.

# with grouping?
str_extract(Area, "\\d*(\\.\\d*)*")
# [1] ""              ""              ""              "790887469.345" "790887469.3"   "790887469"  
# nope!

andresrcs · March 19, 2021, 5:43pm

This regex gets the job done

library(stringr)

Area = c("saturated( 790887469.345 )",
         "saturated( 790887469.3 )",
         "saturated( 790887469 )", 
         "790887469.345", 
         "790887469.3", 
         "790887469")

str_extract(Area, "\\d+\\.?\\d*")
#> [1] "790887469.345" "790887469.3"   "790887469"     "790887469.345"
#> [5] "790887469.3"   "790887469"

^{Created on 2021-03-19 by the reprex package (v1.0.0.9002)}

technocrat · March 19, 2021, 5:58pm

Here's another cut, but with unexpected results, due to display options. It does, however, return the values as numeric.

suppressPackageStartupMessages({
  library(magrittr)
  library(stringr)
})

Area = c("saturated( 790887469.345 )",
         "saturated( 790887469.3 )",
         "saturated( 790887469 )", 
         "790887469.345", 
         "790887469.3", 
         "790887469") 

pattern <- "[0-9.]+"

str_extract(Area,pattern) %>% as.numeric()
#> [1] 790887469 790887469 790887469 790887469 790887469 790887469

# H/T to Roger Dalgaard https://stat.ethz.ch/pipermail/r-help/2002-February/018503.html

(str_extract(Area,pattern) %>% as.numeric()) - 790887469
#> [1] 0.345 0.300 0.000 0.345 0.300 0.000

Matthias · March 20, 2021, 12:30pm

Great, thanks!Actually it seems to be the "+" at the first \d.
Just for curiosity: Any idea why a "matches at least 1 time" is working but a "matches at least 0 times" is not?

andresrcs · March 20, 2021, 12:33pm

Because "0 times" gets matched first, that is why you get " " as result.

system · March 27, 2021, 12:33pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.