Unable to extract string using str_extract

lobshi · June 29, 2019, 8:43am

Hi,

I am using package stringr to extract somes strings from a .txt. It doesn't work for long strings with \n.

For example:

txt = "STEERS - Medium and Large 1 (Per Cwt / Actual Wt)\n  Head            Wt Range          Avg Wt              Price Range      Avg Price\n   4                401              401                  185.00          185.00\n   3                452              452                  174.00          174.00\n   28             624-631            627               150.00-154.00      152.56\n   62             664-689            683               139.00-142.00      141.31\n   86             701-736            714               138.00-143.50      140.76\n   11               794              794                  128.50          128.50\n   8                825              825                  125.00          125.00\n   6                808              808                  122.00          122.00     Fleshy\n\n\nSTEERS"

txt is the string I want to subset. I wrote

tst %>% str_extract("STEERS.+?Fleshy\n\n\nSTEERS")

to extract the whole string. But It returned a NA. I don't see any mistake in my regex.

Could anyone tell me what's wrong?

Thanks!

wilkox · June 29, 2019, 9:14am

In a stringr regex a period . will match any character except a newline. Because there are some newlines in between 'STEERS' and 'Fleshy', the .+ is failing to capture all the characters in between.

You can use parentheses to create a group that will match any character including a newline:

txt %>% str_extract("STEERS(.|\n)+Fleshy\\n\n\nSTEERS")

lobshi · June 29, 2019, 6:23pm

Thank you so much! It works. May I ask a further question? Why do you add a backslash only at the first \n after Fleshy? There are three \n.

wilkox · June 30, 2019, 1:54am

Sorry, that was just a typo – the extra slash is not necessary.

system · July 7, 2019, 1:54am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.