How to choose column IDs based on another dataframe in R

Ellenz · August 14, 2019, 4:33am

Hi all,

I have a question about data manipulation.
There is a dataframe which has coordinates (latitude, longitude).

df.d = data.frame(V1=c(35.25,35.25,35.25,35.25,35.25,35.25,35.25,35.25,35.75,35.75, 35.75,35.75,35.75,35.75,35.75,35.75), V2=c(120.25,120.75,121.25,121.75,122.25,122.75,123.25,123.75,120.25,120.75,121.25,121.75,122.25,122.75,123.25,123.75))

Another dataframe df.all has more columns, where the first column is the ID column (3 columns: ID, latitude, longitude). I want to check if the rows with two column values in df.all exist in df.d, and get the ID column from df.all if so. If one row does not exist, it moves on to the next row to get the ID. How to do this? My tentative code is below. Thanks for your help.
Btw: df.all has a larger dimension than df.d, but I showed here as a small sample.

df.all = data.frame(V1=c(10,11,12,13,14,15,16,17,18,19,20),
V2=c(35.25,35.25,35.25,35.25,35.25,35.25,35.75,35.75,35.75,35.75,35.75),
V3=c(119.75,120.25,121.25,122.25,123.25,123.75,120.75,121.25,121.75,122.75,123.75))

ids = c()
for(i in 1:length(df.d[,1])){
ifelse(sum(df.all[,2]==df.d[i,1]&df.all[,3]==df.d[i,2])>0,
ids[i] = filter(df.all,df.all[,2]==df.d[i,1]&df.all[,3]==df.d[i,2])[,2],i+1)
}

joels · August 14, 2019, 5:47am

It looks like a join will give you the matching IDs. Below are some examples. I've changed the column names so that they're more intuitive and correspond between the two data frames. I've also added an extra column to from.df.d to make it easier to see which rows have matches (or not) across data frames.

library(tidyverse)

df.d = data.frame(V1=c(35.25,35.25,35.25,35.25,35.25,35.25,35.25,35.25,35.75,35.75, 35.75,35.75,35.75,35.75,35.75,35.75), V2=c(120.25,120.75,121.25,121.75,122.25,122.75,123.25,123.75,120.25,120.75,121.25,121.75,122.25,122.75,123.25,123.75))

df.all = data.frame(V1=c(10,11,12,13,14,15,16,17,18,19,20),
                    V2=c(35.25,35.25,35.25,35.25,35.25,35.25,35.75,35.75,35.75,35.75,35.75),
                    V3=c(119.75,120.25,121.25,122.25,123.25,123.75,120.75,121.25,121.75,122.75,123.75))

df.d = df.d %>% rename(lat=V1, lon=V2) %>% mutate(from.df.d = 1)
df.all = df.all %>% rename(ID=V1, lat=V2, lon=V3)

# Return only rows that match
inner_join(df.all, df.d, by=c("lat", "lon"))
#>    ID   lat    lon from.df.d
#> 1  11 35.25 120.25         1
#> 2  12 35.25 121.25         1
#> 3  13 35.25 122.25         1
#> 4  14 35.25 123.25         1
#> 5  15 35.25 123.75         1
#> 6  16 35.75 120.75         1
#> 7  17 35.75 121.25         1
#> 8  18 35.75 121.75         1
#> 9  19 35.75 122.75         1
#> 10 20 35.75 123.75         1

# Get only the IDs from rows that match
inner_join(df.all, df.d, by=c("lat", "lon")) %>% 
  pull(ID)
#>  [1] 11 12 13 14 15 16 17 18 19 20

# Return all rows from both data frames
full_join(df.all, df.d, by=c("lat", "lon"))
#>    ID   lat    lon from.df.d
#> 1  10 35.25 119.75        NA
#> 2  11 35.25 120.25         1
#> 3  12 35.25 121.25         1
#> 4  13 35.25 122.25         1
#> 5  14 35.25 123.25         1
#> 6  15 35.25 123.75         1
#> 7  16 35.75 120.75         1
#> 8  17 35.75 121.25         1
#> 9  18 35.75 121.75         1
#> 10 19 35.75 122.75         1
#> 11 20 35.75 123.75         1
#> 12 NA 35.25 120.75         1
#> 13 NA 35.25 121.75         1
#> 14 NA 35.25 122.75         1
#> 15 NA 35.75 120.25         1
#> 16 NA 35.75 122.25         1
#> 17 NA 35.75 123.25         1

# Return all rows from df.all and matching rows from df.d
left_join(df.all, df.d, by=c("lat", "lon"))
#>    ID   lat    lon from.df.d
#> 1  10 35.25 119.75        NA
#> 2  11 35.25 120.25         1
#> 3  12 35.25 121.25         1
#> 4  13 35.25 122.25         1
#> 5  14 35.25 123.25         1
#> 6  15 35.25 123.75         1
#> 7  16 35.75 120.75         1
#> 8  17 35.75 121.25         1
#> 9  18 35.75 121.75         1
#> 10 19 35.75 122.75         1
#> 11 20 35.75 123.75         1

# Return all rows from df.d and matching rows from df.all
right_join(df.all, df.d, by=c("lat", "lon"))
#>    ID   lat    lon from.df.d
#> 1  11 35.25 120.25         1
#> 2  NA 35.25 120.75         1
#> 3  12 35.25 121.25         1
#> 4  NA 35.25 121.75         1
#> 5  13 35.25 122.25         1
#> 6  NA 35.25 122.75         1
#> 7  14 35.25 123.25         1
#> 8  15 35.25 123.75         1
#> 9  NA 35.75 120.25         1
#> 10 16 35.75 120.75         1
#> 11 17 35.75 121.25         1
#> 12 18 35.75 121.75         1
#> 13 NA 35.75 122.25         1
#> 14 19 35.75 122.75         1
#> 15 NA 35.75 123.25         1
#> 16 20 35.75 123.75         1

^{Created on 2019-08-13 by the reprex package (v0.3.0)}

In your example data, each row has no more than one matching row in the other data frame. Bear in mind, however, that if a row in one data frame matches multiple rows in another data frame, all combinations of matches are returned. For example:

d1 = data.frame(x=c(1,1,2,10), y=11:14)
d2 = data.frame(x=c(1,1,1,2,2,4,5,6), z=21:28)

left_join(d1, d2, by="x")
#>    x  y  z
#> 1  1 11 21
#> 2  1 11 22
#> 3  1 11 23
#> 4  1 12 21
#> 5  1 12 22
#> 6  1 12 23
#> 7  2 13 24
#> 8  2 13 25
#> 9 10 14 NA

For future reference, there are some shortcuts you can use for creating data vectors. For example:

c(10,11,12,13,14,15,16,17,18,19,20)
10:20

c(35.25,35.25,35.25,35.25,35.25,35.25,35.25,35.25,35.75,35.75, 35.75,35.75,35.75,35.75,35.75,35.75)
rep(c(35.25, 35.75), each=8)

If you want different numbers of repetitions of the two numbers, you can do:

rep(c(35.25, 35.75), c(2,4))

Ellenz · August 14, 2019, 6:38am

Thanks for your explicit answers.

system · August 21, 2019, 6:38am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.