distance between two documents with jaccard distance

#1

Split from Double loop to create a dataframe


I have a problem that i thought that i could solve myself, but it doesn't work

I have on my folder, several .txt and i want to analyse the distance between each document.

For example :
/desktop/folder/
I have " 1.txt ", " 2.txt ", " 3.txt"
For each document, i have for every line a word, like " ok, good, funny "

What i want is to studie the distance between two documents with jaccard distance and have a result like :

doc 1 / doc 2 /  result 
1.txt       1.txt       1,0
1.txt       2.txt       0,3
1.txt       3.txt       0,2
2.txt       1.txt       1,0
2.txt       2.txt       0,3
3.txt       3.txt       0,2
3.txt       1.txt       1,0
3.txt       2.txt       0,3
3.txt       3.txt       0,2

In order to that, here is the full code :slight_smile:

folder <- "/desktop/folder/ "      # path to folder that holds multiple .csv files
files_names3 <- list.files(path="/Users/sylvain/desktop/folder/", pattern="*.txt")

create several data frame for each document

for (i in 1:length(files_names3)){
  assign(files_names3[i],
         read.delim(files_names3[i])
  )}

Then, i tried this to create the whole, but it doesn't work, any ideas why ?

all <- ''

for (i in 1:nrow(files_names3)){
  for(i in 1:nrow(files_names3)) {
    all[((i-1)*length(files_names3)+j),] <- c(files_names3[i], files_names3[j], textrank_jaccard(read.delim(paste(folder,top$cat1[i], sep='')),read.delim(paste(folder,top$cat2[i], sep=''))))
  }
}
0 Likes

#2

This is, once again, not a reprex.

You may ask why. Here are a few reasons:

  1. We don't have access to your local files. I understand that the allowable file formats are limited, but you could have written a code to write those files, and then reading it later.

  2. You haven't included any library call. I (and possible many others on this community) do not know what is a jaccard distance. You used a function textrank_jaccard, but have not mentioned its package. I'm guessing textrank, but it may not be the case.

  3. What is top? I've no idea regarding this one.

Please go through the reprex guide. A minimal reproducible example helps others to figure out what problems you may have been facing, and consequently, to help you.

There are a few problems with your code.

  1. files_names3 is a vector. You can't use nrow with it.

  2. You used i in both the for loops.

  3. Why are you using all <- ''? It's a character vector, and you can't add rows with this later.

  4. I'm unable to figure out why do you expect that output will be in that format in your post. The documentation says it returns a single number, so why a tuple? Also, as all the files are identical, why do you think different values will be produced? I've no idea regarding this particular distance measure, but I don't think this is how it is expected to behave.

  5. This is not a problem, but 1.txt, 2.txt, 3.txt as names of some objects is probably a bad idea. It's very confusing in my opinion.

Since it is your first post, I'm making providing a reprex after modifying your code a little bit.

a working code
# loading required library
library(textrank)

# creating files
write.table(x = "ok, good, funny",
            file = "1.txt",
            row.names = FALSE,
            col.names = FALSE)
write.table(x = "ok, good, funny",
            file = "2.txt",
            row.names = FALSE,
            col.names = FALSE)
write.table(x = "ok, good, funny",
            file = "3.txt",
            row.names = FALSE,
            col.names = FALSE)

# listing files
file_names <- list.files(pattern="*.txt")

# reading files
file_contents <- vector(mode = "list",
                        length = length(x = file_names))

for (i in seq_len(length.out = length(x = file_names)))
{
  file_contents[[i]] <- read.delim(file = file_names[i])
}

# calculation
all <- matrix(ncol = 3,
              nrow = ((length(x = file_names)) ^ 2))

for(i in seq_len(length.out = length(x = file_names)))
{
  for(j in seq_len(length.out = length(x = file_names)))
  {
    all[((i - 1) * length(x = file_names) + j), ] <- c(file_names[i], file_names[j], textrank_jaccard(termsa = file_contents[[i]],
                                                                                                      termsb = file_contents[[j]]))
  }
}

all <- as.data.frame(x = all)

all
#>      V1    V2 V3
#> 1 1.txt 1.txt  1
#> 2 1.txt 2.txt  1
#> 3 1.txt 3.txt  1
#> 4 2.txt 1.txt  1
#> 5 2.txt 2.txt  1
#> 6 2.txt 3.txt  1
#> 7 3.txt 1.txt  1
#> 8 3.txt 2.txt  1
#> 9 3.txt 3.txt  1

Created on 2019-03-25 by the reprex package (v0.2.1)

0 Likes

Double loop to create a dataframe
closed

This topic has been closed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.
#3
0 Likes

opened #4
0 Likes