Performance issues working with large (n=768) NLP feature vectors

bwbenson · December 6, 2020, 2:09pm

I have been using the spaCy library and BERT NLP model to generate feature vectors for natural language text. These feature vectors consist of 768 floating point numbers for each piece of text.

In Python, I store the vector in a data frame column and then pass the vectors as the exogenous variables for a linear regression model.

In R/tidyverse I am running into issues in how to pass the vectors to the lm function for modeling purposes. I learned that you can pass all remaining values in a data frame to lm using dot (.) and transforming into this format with all of the vector elements as columns works for small/toy cases, but unnest_wider is much too slow in the real-world case.

Toy case

library(tidyverse)

VECTOR_SIZE <- 5
NUM_ROWS <- 21

df <- tibble(endog_var = runif(NUM_ROWS)) %>%
    mutate(exog_vec = map(seq(NUM_ROWS), function(n) {runif(VECTOR_SIZE)}))

df_wider <- df %>% unnest_wider(exog_vec)

df_wider

lm(endog_var ~ ., data=df_wider)

Actual-size

VECTOR_SIZE <- 768
NUM_ROWS <- 147957

df <- tibble(endog_var = runif(NUM_ROWS)) %>%
    mutate(exog_vec = map(seq(NUM_ROWS), function(n) {runif(VECTOR_SIZE)}))

# This step is much too slow and produces tons of textual output
# df_wider <- df %>% unnest_wider(exog_vec)

# lm(endog_var ~ ., data=df_wider)

Is there a better approach for working with these vectors and passing them as input to models?

nirgrahamuk · December 6, 2020, 3:47pm

try

 as.data.frame(do.call(rbind,purrr::map2(df$endog_var,df$exog_vec,
                                         ~cbind(endog_var = .x, t(matrix(.y))))))

system · December 27, 2020, 3:47pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.