Extract different percentages/numbers from a paragraph/string in r

I'm a novice in R and am struggling with extracting percentages/numbers from strings in a data frame. For example,

df <- data.frame(
  Species =c("Bidens pilosa","Orobanche ramose"),
  Impact = c("Soyabean yield loss was 10%. A density of one plant resulted in a yield loss of 9.4%; two plants, 17.3%; and four to eight plants, 28%...In contrast, suppression of the weed by the crop was only 10%","Cypress was estimated to have a 28% loss annually. The annual increase of the disease in some stands in the Peloponnesus, with an initial attack of 20%, ranged from 5% to 20% ")

My questions are the following:

In this case, I only want to extract yield loss for different crops, which is 10 and 28, and hope to skip percentages and numbers regarding other aspects (such as 9.4%,17.3%, 5* etc.) Can I achieve this objective through R? Or it requires some skill about natural language processing?
If it's hard to distinguish different types of percentages, how to extract all percentages/numbers at one time so that I can pick the right number manually. I have tried to use

df %>% str_match_all("[0-9]+") %>% unlist %>% as.numeric

or

parse_number(df$Impact)

But I think none of them works, because they give me continuous lines of numbers.

Thanks for your help.

If you can assume that the first percentage listed in each text field is the yield loss, and the following percentages are of secondary value (I have no idea if this is a ludicrous assumption or not) you can get the first parsed number rather easily through the parse_number within a mutate call.

library(tidyverse)
df <- tibble(
  Species =c("Bidens pilosa","Orobanche ramose"),
  Impact = c("Soyabean yield loss was 10%. A density of one plant resulted in a yield loss of 9.4%; two plants, 17.3%; and four to eight plants, 28%...In contrast, suppression of the weed by the crop was only 10%","Cypress was estimated to have a 28% loss annually. The annual increase of the disease in some stands in the Peloponnesus, with an initial attack of 20%, ranged from 5% to 20% "))

df %>% 
  mutate(yield_loss = parse_number(Impact))
#> # A tibble: 2 x 3
#>   Species       Impact                                                yield_loss
#>   <chr>         <chr>                                                      <dbl>
#> 1 Bidens pilosa "Soyabean yield loss was 10%. A density of one plant…         10
#> 2 Orobanche ra… "Cypress was estimated to have a 28% loss annually. …         28

Created on 2020-05-28 by the reprex package (v0.3.0)

I'm not sure if that helps, but anything more complex than that it would be of help to know how much data there is and how much time you're willing to put in to avoid labelling these by hand. That is, if this a dataset of 100-200 lines it's probably much quicker to hand-label these than implementing an ML approach, but if you have 10 million rows that's obviously a different case altogether.

1 Like

I don't know if it is what you want but:

df %>% 
  mutate( 
    percents = str_extract_all(Impact, '\\d+%')
  ) %>% 
  unnest(percents) %>% 
  pull(percents)

The idea is create a new column with mutate and extract with pull.

I add the percentage simbol, then you get a string.

But if you want just the numbers

df %>% mutate( percents = str_extract_all(Impact, '\\d+')) %>% unnest(percents) %>% pull(percents) %>% as.numeric()

Yeah I agree with you. I only have around 300 data, probably handle it manually will be a better choice.
By the way, I'm quite confused that is there another way such that I can extract numbers that I want but drops others that are useless? Thank you very much.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.