Nearest Neighbors Search With Cosine Similarity

I have a dataframe of word embeddings that I've created and I want to perform a K-Nearest Neighbors search with cosine similarity. In other words, for a given word embedding (a row in my dataset) give me the nearest k word embeddings according to the cosine similarity metric.

In python I would do this with sklearn.neighbors.NearestNeighbors and specify metric = 'cosine'. However, to get the output back into a dataframe that is in terms of an identifier column instead of row indices there is some data manipulation work that I think would be more cleanly done in R with dplyr/tidyr/etc.

I did find the FNN package which can do a nearest neighbors search with the 'get.knnx' function, but it doesn't look like you can specify a metric like cosine similarity. Is there a package that has this capability in R? If not, would this potentially fall under the scope of one of the tidymodels or tidyverse packages in the future?

you could calculate the cosine similarities as a first step, then pass those results to be clustered ?

check the https://cran.r-project.org/web/packages/tcR/ for cosine.similarity

Thanks for the link. This is true, however this approach would require calculating cosine similarity for all possible combinations as an initial step. For smaller datasets this isn't too big of an issue, but for larger datasets I'd likely run into performance issues that would benefit from search methods such as kd-tree.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.