Hello dear community.
I'm trying to clean a dataset which has miss spelled words. I couldnt solve it with regex since there is no pattern to move on. But I'm trying to use stringdist package to find high percent matches in another dataset which I created. I really dont know about for loops that much and couldnt figure out how can I apply a for loop to this. Here is an example of my data.
data1 <- c("neeyork","dalas","houson","new york")
data2 <- c("houston","newyork","dallas","washington")
for (i in 1:length(data1)) {
data99 <- data.table(stringsim(data1[i],data2, method = 'cosine'))
}
#It gives me something like this
#>1 0.3535534
#>2 0.9354143
#>3 0.0000000
#>4 0.4082483
Correct me if I'm wrong but I think I'm failing for loop here cause its only trying to match "neeyork" with data2 values. How can I fix it? Also even if I fix this how should I suppose to know which data1 value matched with which data2 value? I have over 10k value in data1 and 50k+ value in data2.