Split rows and merge data (Faster Approach)

xiaoni · February 26, 2020, 5:34am

Hi. I have a data on which I want to separate the rows.

df <- data.frame(text=c("Lately, I haven't been able to view my Online Payment Card. It's prompting me to have to upgrade my account whereas before it didn't. I have used the Card at various online stores before and have successfully used it. But now it's starting to get very frustrating that I have to said \"upgrade\" my account. Do fix this... **I noticed some users have the same issue..","I've been using this app for almost 2 years without any problems. Until, their system just blocked my virtual paying card without any notice. So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs. This app has been a big disappointment."), id=c(1,2), stringsAsFactors = FALSE)

I want split the sentences in the text column and come up with the following:

df <- data.frame (text = c("Lately, I haven't been able to view my Online Payment Card. It's prompting me to have to upgrade my account whereas before it didn't. I have used the Card at various online stores before and have successfully used it. But now it's starting to get very frustrating that I have to said \"upgrade\" my account. Do fix this... **I noticed some users have the same issue..", 
                            "I've been using this app for almost 2 years without any problems. Until, their system just blocked my virtual paying card without any notice. So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs. This app has been a big disappointment.", 
                            "Lately, I haven't been able to view my Online Payment Card.", 
                            "It's prompting me to have to upgrade my account whereas before it didn't.", 
                            "I have used the Card at various online stores before and have successfully used it.", 
                            "But now it's starting to get very frustrating that I have to said upgrade my account.", 
                            "Do fix this|", "**I noticed some users have the same issue|", 
                            "I've been using this app for almost 2 years without any problems.", 
                            "Until, their system just blocked my virtual paying card without any notice.", 
                            "So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs.", 
                            "This app has been a big disappointment."), id = c(1, 2, 1, 1, 
                                                                               1, 1, 1, 1, 2, 2, 2, 2), tag = c("DONE", "DONE", NA, NA, NA, 
                                                                                                                NA, NA, NA, NA, NA, NA, NA), stringsAsFactors = FALSE)

I have done it using this code however I think for-loop is so slow. I need to do this for 73,000 rows. So I need a faster approach.

library("qdap")
df$tag <- NA
for (review_num in 1:nrow(df)) {
  x = sent_detect(df$text[review_num])
  if (length(x) > 1) {
    for (sentence_num in 1:length(x)) {
      df <- rbind(df, df[review_num,])
      df$text[nrow(df)]   <- x[sentence_num]
    }
    df$tag[review_num] <- "DONE"
  }
}

DavoWW · February 26, 2020, 7:29am

Hi there,
Try this tidyverse approach which may be quicker:

library(qdap)
library(dplyr)
library(tidyr)

df <- data.frame(text=c("Lately, I haven't been able to view my Online Payment Card. It's prompting me to have to upgrade my account whereas before it didn't. I have used the Card at various online stores before and have successfully used it. But now it's starting to get very frustrating that I have to said \"upgrade\" my account. Do fix this... **I noticed some users have the same issue..","I've been using this app for almost 2 years without any problems. Until, their system just blocked my virtual paying card without any notice. So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs. This app has been a big disappointment."), id=c(1,2), stringsAsFactors = FALSE)

df %>%
  group_by(text) %>% 
  mutate(sentences = list(sent_detect(df$text))) %>% 
  unnest(cols=sentences) -> out.df

out.df

Someone may be able to supply you with a {data.table} solution.
HTH

xiaoni · February 27, 2020, 7:04am

Thank you @DavoWW for this. I have checked the performance for this but for 29000 rows, it took 170 minutes.

I also come up with another solution but I have tried 29000 rows and it still takes 28 minutes. Below is the code.

reviews_df1 <- data.frame(id=character(0), text=character(0))
for (review_num in 1:nrow(df)) {
preprocess_sent <- sent_detect(df$text[review_num])
if (length(preprocess_sent) > 0) {
        x <- data.frame(id=df$id[review_num],
                        text=preprocess_sent)
        reviews_df <- rbind(reviews_df1, x)
      }
     colnames(reviews_df) <- c("id", "text")
}

For 73000 rows, it still took 252 minutes or ~4hours. I think it is because of the rbind.

nirgrahamuk · February 27, 2020, 9:27am

If all you care about are the results and not so much which line was birthed from which other line, by far the simplest and fastest approach would be using sent_detect in a vectorised fashion, and converting the result only once to a dataframe.

 sent_detect(df$text) %>% enframe

benchmarking:

Unit: milliseconds
             expr      min       lq      mean    median        uq       max neval
   og1_r <- og1() 3.532601 3.579601  4.135687  3.648151  4.163600 13.325601    50
   dp2_r <- dp2() 9.171301 9.691102 11.231707 10.461751 12.298801 18.969201    50
   rv2_r <- rv2() 2.840801 2.877400  3.416939  3.007852  3.984101  6.159802    50
 vec3_r <- vec3() 1.092001 1.102000  1.285935  1.111101  1.329501  3.188600    50

below is a full benchmarking script you can use to compare the different approaches seen. I think all the other approaches introduced anomalies in terms of number of sentences they found, the early approaches duplicated lines.

library(microbenchmark)
library(qdap)
library(tidyverse)
df <- data.frame(text=c("Lately, I haven't been able to view my Online Payment Card. It's prompting me to have to upgrade my account whereas before it didn't. I have used the Card at various online stores before and have successfully used it. But now it's starting to get very frustrating that I have to said \"upgrade\" my account. Do fix this... **I noticed some users have the same issue..","I've been using this app for almost 2 years without any problems. Until, their system just blocked my virtual paying card without any notice. So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs. This app has been a big disappointment."), id=c(1,2), stringsAsFactors = FALSE)

og1 <- function(){
  df$tag <- NA
  for (review_num in 1:nrow(df)) {
    x = sent_detect(df$text[review_num])
    if (length(x) > 1) {
      for (sentence_num in 1:length(x)) {
        df <- rbind(df, df[review_num,])
        df$text[nrow(df)]   <- x[sentence_num]
      }
      df$tag[review_num] <- "DONE"
    }
  }
  return(tibble(df))
}

dp2 <- function(){
  
  df %>%
    group_by(text) %>% 
    mutate(sentences = list(sent_detect(df$text))) %>% 
    unnest(cols=sentences) -> out.df
  
  out.df
}

rv2 <- function(){
reviews_df1 <- data.frame(id=character(0), text=character(0))
for (review_num in 1:nrow(df)) {
  preprocess_sent <- sent_detect(df$text[review_num])
  if (length(preprocess_sent) > 0) {
    x <- data.frame(id=df$id[review_num],
                    text=preprocess_sent)
    reviews_df <- rbind(reviews_df1, x)
  }
  colnames(reviews_df) <- c("id", "text")
}
return(tibble(reviews_df))
}

vec3 <- function()
{
  sent_detect(df$text) %>% enframe
}

og1_rm<-microbenchmark(og1_r <- og1(),times=50L)
dp2_rm<-microbenchmark(dp2_r <- dp2(),times=50L)
rv2_rm<-microbenchmark(rv2_r <- rv2(),times=50L)
vec3_rm<-microbenchmark(vec3_r <-vec3(),times=50L)

og1_r
dp2_r
rv2_r
vec3_r


rbind(og1_rm,dp2_rm,rv2_rm,vec3_rm)
# Unit: milliseconds
# expr      min       lq      mean    median        uq       max neval
# og1_r <- og1() 3.532601 3.579601  4.135687  3.648151  4.163600 13.325601    50
# dp2_r <- dp2() 9.171301 9.691102 11.231707 10.461751 12.298801 18.969201    50
# rv2_r <- rv2() 2.840801 2.877400  3.416939  3.007852  3.984101  6.159802    50
# vec3_r <- vec3() 1.092001 1.102000  1.285935  1.111101  1.329501  3.188600    50

nrow(og1_r)
nrow(dp2_r)
nrow(rv2_r)
nrow(vec3_r)

xiaoni · February 27, 2020, 9:52am

Hi @nirgrahamuk. I will check how can the vectorized fashion can be used however, I also need the which line was birthed from which other line so I also added the ID in reviews_df.

So far, I was able to reduce 28 minutes to 3 minutes by using bind_rows instead of rbind. Though I am encountering warning errors using this approach. For 73000 rows, it is still 16 minutes so I am still searching to speed it up.

nirgrahamuk · February 27, 2020, 10:01am

ok, check out the sentSplit function of qdap, it provides a number in TOT field. and its pretty fast also.

vec4 <- function()
{
  sentSplit(dataframe = df,text.var = "text")
}
vec4_rm<-microbenchmark(vec4_r <-vec4(),times=50L)
#----------------
Unit: milliseconds
             expr    min     lq    mean  median     uq    max neval
 vec4_r <- vec4() 1.9888 2.0079 2.30414 2.05195 2.3544 5.7475    50

xiaoni · February 28, 2020, 1:11pm

Hii @nirgrahamuk this works faster. Though I still need extra implementation like if the sentence has no endmark, it will return NA. Anyway, Thank you so much

nirgrahamuk · February 28, 2020, 1:42pm

Maybe preprocessing all the original lines, to place a final endmark if one is missing ,would solve that ?

xiaoni · February 28, 2020, 1:58pm

Yes, that is the approach. @nirgrahamuk
Below is my code to solve that.

missing <- end_mark(df$text) == "_"
  df$text[missing] <- paste0(df$text[missing], "-")
  df <- sentSplit(dataframe = df,text.var = "text", endmarks = c("-"))

xiaoni · February 28, 2020, 1:59pm

I have posted another question related to speed up code. Speed Up Textual Data Preprocessing and POS Tagging

Just wondering if you have idea as well. @nirgrahamuk