I am using the new
conText package in
R to run a context embedding regression model. This model allows me to assess whether the context in which a focal word appears -- the words before and after it -- varies as a function of covariates. Below I provide the code I have written thus far:
# load packages library(quanteda) library(ldatuning) library(topicmodels) library(tidytext) library(tidyverse) library(parallel) library(conText) library(data.table) library(text2vec) # load speeches speeches <- read_csv("speeches_final.csv") # create corpus # preparing speeches speeches$text <- as.character(speeches$text) speeches$docnames <- seq.int(nrow(speeches)) speeches_corpus <- quanteda::corpus(speeches,text_field ="text") # tokenize corpus removing unnecessary (i.e. semantically uninformative) elements toks <- tokens(speeches_corpus, remove_punct = T, remove_symbols = T, remove_numbers = T, remove_separators = T) # clean out stopwords and words with 2 or fewer characters toks_nostop <- tokens_select(toks, pattern = stopwords("ru", source = "snowball"), selection = "remove", min_nchar = 3 ) # only use features that appear at least 5 times in the corpus feats <- dfm(toks_nostop, tolower = T, verbose = TRUE) %>% dfm_trim(min_termfreq = 5) %>% featnames() # leave the pads so that non-adjacent words will not become adjacent toks <- tokens_select(toks_nostop, feats, padding = TRUE) # build a tokenized corpus of contexts sorrounding the target term 'economy' economy_toks <- tokens_context(x = toks, pattern = "экономи*", window = 6L) # build document-feature matrix economy_dfm <- dfm(economy_toks) economy_dfm[1:3, 1:3] # construct the feature co-occurrence matrix for our toks object (see above) toks_fcm <- fcm(toks, context = "window", window = 6, count = "frequency", tri = FALSE) # estimate glove model using text2vec glove <- GlobalVectors$new(rank = 300, x_max = 10, learning_rate = 0.05) wv_main <- glove$fit_transform(toks_fcm, n_iter = 10, convergence_tol = 0.001, n_threads = parallel::detectCores()) # set to 'parallel::detectCores()' to use all available cores wv_context <- glove$components local_glove <- wv_main + t(wv_context) # word vectors local_transform <- compute_transform(x = toks_fcm, pre_trained = local_glove, weighting = "log")
All of the above code executes without issue. The problem occurs when I try to run the next chunk of code, the actual
conText model. In this case, my focal word is экономи* (Russian for economy) and my covariates are dummy variables for date and party affiliation.
# run the context embedding regression model set.seed(2021L) model1 <- conText(formula = "экономи*" ~ Date_dummy + party_ur, data = toks, pre_trained = local_glove, transform = TRUE, transform_matrix = local_transform, bootstrap = TRUE, num_bootstraps = 10, permute = TRUE, num_permutations = 100, window = 6L, case_insensitive = TRUE, verbose = TRUE)
When I run this code, I receive the following error message:
Error in solve.default(t(X_mat) %*% X_mat) : system is computationally singular: reciprocal condition number = 0. This suggests that the design matrix is not invertible. I have performed checks to make sure that my variables are not collinear. I have tried debugging the code to no avail. I am truly lost as to what is going on here. Note that when I run the above model with
Date_dummy as the only covariate, I do get results. This leads me to believe that something is going on with the
party_ur variable. I am happy to provide my full code and data if that would help. Any feedback would be greatly appreciated.