How to account for NA's in speaker identification when text is not uniformly formatted

Hi all!

I am pulling text from a podcast transcript and although most lines are prefaced by the speaker, there are some lines that have no preface and are run-ons from the previous speaker.

For example:

JON SMITH: How are you all doing today?
The weather is pretty cold I think
JANE DOE you are right about that Jon

df <- 
tibble(quotes = 
      c("JON SMITH: How are you all doing today?", 
        "The weather is pretty cold I think",
        "JANE DOE you are right about that Jon"),
       line = 1:length(quotes))
# Create Speaker Column
df %>%
    mutate(speaker = case_when(
        str_detect("^JO") ~ "Jon",
        str_detect("^JA") ~ "Jane"))

The resulting table would look like this:

line | quotes                                  | speaker
1    | JON SMITH: How are you all doing today? | Jon
2    | The weather is pretty cold I think      | NA
3    | JANE DOE you are right about that Jon   | Jane 

I am able to create a new column speaker_na with the following code:

speaker_na = ifelse(is.na(speaker), lag(speaker), NA))

Which results in:

line | quotes                                  | speaker | speaker_na 
1    | JON SMITH: How are you all doing today? | Jon     | NA
2    | The weather is pretty cold I think      | NA      | Jon
3    | JANE DOE you are right about that Jon   | Jane    | NA

I can't seem to figure out how to a) then collapse these columns and b) what to do in cases where a speaker happens to say three or four lines of text

line | quotes                                   | speaker | speaker_na 
4    | JON SMITH: But you already knew that     | Jon     | NA
5    | the typical turn around waiting could be | NA      | Jon
6    | anywhere from 2 to 6 hours               | NA      | NA
7    | and that is being generous!              | NA      | NA

Thank you for any help! I tried to provide enough information, but if anything else is requested I will happily supply!

When you say "collapse" these lines, what exactly do you mean? And if a speaker says three or four lines, are you saying you want to propagate their name down until someone else starts speaking?

Exactly!

I would like the resultant column to look like:

line | quotes                                   | speaker | speaker_na  | speaker_combined
__________________________________________________________________________________________
1    | JON SMITH: How are you all doing today?  | Jon     | NA          |  Jon
2    | The weather is pretty cold I think       | NA      | Jon         |  Jon
3    | JANE DOE you are right about that Jon    | Jane    | NA          |  Jane
4    | JON SMITH: But you already knew that     | Jon     | NA          |  Jon
5    | the typical turn around waiting could be | NA      | Jon         |  Jon
6    | anywhere from 2 to 6 hours               | NA      | NA          |  Jon
7    | and that is being generous!              | NA      | NA          |  Jon
8    | JANE DOE  thanks devtsch75               | Jane    | NA          | Jane  

Then I guess tidyr::fill() will be your friend. Fill in missing values with previous or next value — fill • tidyr. And it looks like it can be applied just after mutate(speaker ....

2 Likes

Wow, thank you so much! I did not know about tidyr::fill() and it did exactly the right trick!

df_rs %>% 
  mutate(
    speaker = case_when(
      str_detect(quotes,"^JO") ~"Jon",
      str_detect(quotes,"^JA") ~"Jane")) %>% 
  fill(speaker)


  quotes                                    line speaker
  <chr>                                    <int> <chr>  
1 JON SMITH: How are you all doing today?      1 Jon    
2 The weather is pretty cold I think           2 Jon    
3 JANE DOE you are right about that Jon        3 Jane   
4 JON SMITH: But you already knew that         4 Jon    
5 the typical turn around waiting could be     5 Jon    
6 anywhere from 2 to 6 hours                   6 Jon    
7 and that is being generous!                  7 Jon 
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.