Splitting a bi-gram into two separate columns

#1

I got a document feature matrix from the quanteda package. The features are bigrams in this form word1_word2.
feature
1 good_morning
2 right_now
3 years_ago
4 last_night
5 r_u
6 ou_know

I would like to separate these bigrams into word1 in one column and word2 in another column, looking like this.
word1 word2
good morning
right now
years ago
last night
r u
ou know

The left of the underscore becomes word1, while the right of the underscore becomes word2. How do I do this in R?

0 Likes

#2

I'm not familiar with the output of the quanteda package but if you can conver the output to a dataframe then you can use tidyr::separate()

library(tidyr)

df <- data.frame(stringsAsFactors = FALSE,
                 bigram = c("good_morning", "right_now", "years_ago",
                            "last_night", "r_u", "ou_know")) 
separate(df, bigram, c("word1", "word2"), sep = "_")
#>   word1   word2
#> 1  good morning
#> 2 right     now
#> 3 years     ago
#> 4  last   night
#> 5     r       u
#> 6    ou    know

Created on 2019-04-15 by the reprex package (v0.2.1.9000)

1 Like

#3

Thank you very much! It worked! Another question--how do I separate trigrams? My data looks like this:
feature
1 enjoying_case_presentations
2 case_presentations_students
3 presentations_students_w
4 students_w_good
5 w_good_luck
6 good_luck_students

I would like it to look like this
words1:2 word3
1 enjoying_case presentations
2 case_presentations students
3 presentations_students w
4 students_w good
5 w_good luck
6 good_luck students

0 Likes

#4

I'm not at all good in regular expressions. However based on this answer on SO, here's a way to do it:

library(tidyr)

df <- data.frame(stringsAsFactors = FALSE,
                 bigram = c("enjoying_case_presentations", "case_presentations_students", "presentations_students_w", "students_w_good", "w_good_luck", "good_luck_students")) 

separate(df, bigram, c("word_12", "word_3"), sep = "_(?=[^_]+$)")
#>                  word_12        word_3
#> 1          enjoying_case presentations
#> 2     case_presentations      students
#> 3 presentations_students             w
#> 4             students_w          good
#> 5                 w_good          luck
#> 6              good_luck      students

Created on 2019-04-16 by the reprex package (v0.2.1)

0 Likes

#5

Thank you! It worked!

0 Likes