How to match patterns between different datasets.

darkstormx1 · June 3, 2020, 12:04pm

Hi community, I have a task that I need to complete but I couldnt figure it out. Maybe you can help me. So I have a data set which contains places over 50k.

dt1 <- data.table (places <- c("place1","place2","place3","place49998","place50000"),
zipcodes <- 1:50000 )

I need to categorize this places city by city. Some zip codes are missing so they left as NA they are not that important focus point is places and city names. I didnt have any other datasets so I've created a dataset which includes cities and places like this

dt2<- data.table(
  city1 <- c("place1","place2"),
  city2 <- c("place3","place4"),
  city3 <- c("place49998","place50000")
)
setnames(dt2,"V1","NY")
setnames(dt2,"V2","LA")
setnames(dt2,"V3","DC")

So what I want to achieve here is, for every row in dt1 it should look dt2 if there is a match in place names it should paste the city value of that place to the new column in dt1$city. I just started the learn R (only 1.5 months) but I think I need lapply or for-loop for this. Also I should mention that I'm not allowed to use dyplr.

Yarnabrina · June 3, 2020, 12:33pm

Hi, and welcome to the community!

data.table instead of dplyr is not a problem, but I'm afraid I did not understand your question clearly. Can you please explain a bit more?

I understand you have two data tables. The first one, dt1, has two columns, place names and zip codes. The structure of second column I didn't follow. Is it like one column for each of the cities with lots of places in that city? The way you use <- instead of = inside data table creation step is unusual, and more so as you have to rename each column separately.

Do you want to replace dt2 with a data table with one column for each city with zip codes instead of place names? I really did not follow what have you meant by city value?

Can you prepare a small reproducible and copy-paste friendly example with one small sample (say 10-20 rows) of dt1 and dt2, and of your expected output?

darkstormx1 · June 3, 2020, 1:27pm

Here is first 5 row each datatable.

dt1 <-
  data.table(
    places <- c("Harlem","East Village","Hell's Kitchen","Capitol Hill","Dupont Circle","Harlem"),
    zipcodes <- c("10030", "11354", "10019", "98102", "20037","10043")
  )

setnames(dt1,old = c("V1","V2"),new= c("Neigbourhood","Zipcode"))

for dt2 You understand it correctly I have one column for each city and places (neighbourhoods) underthem like this (first 2 columns and 5 rows there are many places neighbourhoods ofc but i didnt type them all for now)

dt2 <- 
  data.table( 
    city1 <- c("NewYork", "Harlem","East Village","Hell's Kitchen","Polo Grounds","Upper Eastside"),
    city2 <- c("Washington DC","Capitol Hill","Dupont Circle","Navy Yard","Anacostia","Foggy Bottom"))
colnames(dt2) <- as.character(dt2[1, ])
dt2 <- dt2[-1,]

as you can see I have Harlem, Hell's Kitchen ... Capitol Hill in dt1 this values are repeating because they are huge areas and have different zipcodes i mean Hell's kitchen has 2 zipcodes harlem has over 10 zipcodes thats why I need to match Neighbourhood names and Cities.

I want to have something like this in the end.

dt3 <-
  data.table(
    places <- c("Harlem","East Village","Hell's Kitchen","Capitol Hill","Dupont Circle","Harlem"),
    zipcodes <- c("10030", "11354", "10019", "98102", "20037","10043"),
    city <- c("NY","NY","NY","DC","DC","NY")
  )

setnames(dt3,old = c("V1","V2","V3"),new= c("Neigbourhood","Zipcode","City"))

When it sees Harlem in dt1 should look up to dt2 and if there is a match it should paste that entry to dt1$City column.
English is not my native and not that familiar with the R sorry if i couldnt express myself. Have a nice day.

Yarnabrina · June 3, 2020, 2:14pm

It's not mine either, so no worries.

I've tried to provide two solutions, see if those help or not:

library(data.table)

dt1 <- data.table(Neighbourhood = c("Harlem", "East Village", "Hell's Kitchen", "Capitol Hill", "Dupont Circle", "Harlem"),
                  Zipcode = c("10030", "11354", "10019", "98102", "20037", "10043"))
dt1
#>     Neighbourhood Zipcode
#> 1:         Harlem   10030
#> 2:   East Village   11354
#> 3: Hell's Kitchen   10019
#> 4:   Capitol Hill   98102
#> 5:  Dupont Circle   20037
#> 6:         Harlem   10043

dt2 <- data.table(`New York` = c("Harlem", "East Village", "Hell's Kitchen", "Polo Grounds", "Upper Eastside"),
                  `Washington DC` = c("Capitol Hill", "Dupont Circle", "Navy Yard", "Anacostia", "Foggy Bottom"))
dt2
#>          New York Washington DC
#> 1:         Harlem  Capitol Hill
#> 2:   East Village Dupont Circle
#> 3: Hell's Kitchen     Navy Yard
#> 4:   Polo Grounds     Anacostia
#> 5: Upper Eastside  Foggy Bottom

dt2_long <- melt(data = dt2,
                 measure.vars = names(x = dt2),
                 variable.name = "City",
                 value.name = "Neighbourhood")
dt2_long
#>              City  Neighbourhood
#>  1:      New York         Harlem
#>  2:      New York   East Village
#>  3:      New York Hell's Kitchen
#>  4:      New York   Polo Grounds
#>  5:      New York Upper Eastside
#>  6: Washington DC   Capitol Hill
#>  7: Washington DC  Dupont Circle
#>  8: Washington DC      Navy Yard
#>  9: Washington DC      Anacostia
#> 10: Washington DC   Foggy Bottom

dt3 <- dt2_long[dt1, on="Neighbourhood"]
dt3
#>             City  Neighbourhood Zipcode
#> 1:      New York         Harlem   10030
#> 2:      New York   East Village   11354
#> 3:      New York Hell's Kitchen   10019
#> 4: Washington DC   Capitol Hill   98102
#> 5: Washington DC  Dupont Circle   20037
#> 6:      New York         Harlem   10043

dt4 <- merge(x = dt1,
             y = dt2_long,
             by = "Neighbourhood")
dt4
#>     Neighbourhood Zipcode          City
#> 1:   Capitol Hill   98102 Washington DC
#> 2:  Dupont Circle   20037 Washington DC
#> 3:   East Village   11354      New York
#> 4:         Harlem   10030      New York
#> 5:         Harlem   10043      New York
#> 6: Hell's Kitchen   10019      New York

I'm not good in data.table, so I'm sure more efficient solutions exist for joining in place. May be some one will post and I'll get one more motivation to learn it thoroughly.

darkstormx1 · June 3, 2020, 2:34pm

Actually I knew about melt function but forgot it not working for a while because of this Corona. I'll try if it's work or not for me. Thank you.

martin.R · June 3, 2020, 3:17pm

Yarnabrina's dt3 and dt4 solutions are both fine.

For very large data sets (millions of rows) you could use setkey() on each data table first and then join via e.g. dt2_long[dt1] in this example.

It's worth stating though that dt2 is not a good way to store data as there is no relationship between the entries of city1 and city2.

darkstormx1 · June 3, 2020, 3:29pm

I applied this solution it worked for most part but there are some spelling mistakes in Neighbourhood column of dt1 do you have any suggestion for a partial match? I mean is it possible to match dt1'neigbourhood values partially with dt2'neigbourhood?

For example
East Village is correct spelling of the place R doesnt realize if it's written as, EastVillage or in some cases there are typo mistakes like: EastVilage missing a "L" letter. Maybe it can work if I can match 6-7 letters. But its a long shot I guess.

I'm looking stackoverflow etc. couldnt find any solutions, but probably requires grepl usage?

martin.R · June 3, 2020, 3:47pm

This is one solution:
https://cran.r-project.org/web/packages/fuzzyjoin/index.html

system · June 10, 2020, 3:47pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.