Filtering matching pattern (ignoring extra two digits)

library(tidyverse)
# toy data
df <- tibble(v = c(123456, 12045604, 45678, 89079, 20345, 2034512, 12345604))
df
#> # A tibble: 7 x 1
#>          v
#>      <dbl>
#> 1   123456
#> 2 12045604
#> 3    45678
#> 4    89079
#> 5    20345
#> 6  2034512
#> 7 12345604

How do I keep those observations for whom only the last two digits do not match? For example, Row 1 is 12345 and Row 7 is 12345604: everything matches except the last two digits. So, I want to keep them. Similary, row 5 and 6.

Here is my desired output:

#> # A tibble: 4 x 1
#>          v
#>      <dbl>
#> 1   123456
#> 2 12345604
#> 3    20345
#> 4  2034512

This works with your toy data. I hope it is not to specialized to work with you actual data. Note that the final values are characters but they could be changed to number with as.numeric().

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.2
# toy data
df <- tibble(v = c(123456, 12045604, 45678, 89079, 20345, 2034512, 12345604))
df
#> # A tibble: 7 x 1
#>          v
#>      <dbl>
#> 1   123456
#> 2 12045604
#> 3    45678
#> 4    89079
#> 5    20345
#> 6  2034512
#> 7 12345604
df <- df |> mutate(v = as.character(v))
NewDF <- df |> mutate(TRUNC = str_sub(v,1,-3))
df |> inner_join(NewDF, by = c(v = "TRUNC")) |> 
  pivot_longer(cols = everything()) |> 
  select(value)
#> # A tibble: 4 x 1
#>   value   
#>   <chr>   
#> 1 123456  
#> 2 12345604
#> 3 20345   
#> 4 2034512

Created on 2022-02-17 by the reprex package (v2.0.1)

1 Like

@FJCC Many thanks for solving my problem! Could please explain this part
inner_join(NewDF, by = c(v = "TRUNC"))? In particular, I would like to know what v = "TRUNC" is doing. Thanks again!

NewDF is derived from df. It has the same v column as df and a new column named TRUNC that has the values of v with the last two digits removed.
NewDF

v        TRUNC 
  <chr>    <chr> 
1 123456   1234  
2 12045604 120456
3 45678    456   
4 89079    890   
5 20345    203   
6 2034512  20345 
7 12345604 123456

The code

df |> inner_join(NewDF, by = c(v = "TRUNC"))

takes each row of df and looks for a row in NewDF where the v value of df matches the TRUNC value of NewDF. That condition is expressed as by = c(v = "TRUNC"). That is, do the matching by the condition that v = TRUNC. The result of that is this

df |> inner_join(NewDF, by = c(v = "TRUNC"))
# A tibble: 2 x 2
  v      v.y     
  <chr>  <chr>   
1 123456 12345604
2 20345  2034512 

Two possibly surprising things about that result are the column named v.y and the lack of a TRUNC column. The v.y column is just the v column from NewDF. The inner_join function avoids returning a data frame with identical column names by appending .y to the columns from the second data frame in the function call. The TRUNC column has disappeared because its values are identical to the v values of df. We know that because that was the matching condition used in the join. We could keep the TRUNC column by setting the keep argument in inner_join.

df |> inner_join(NewDF, by = c(v = "TRUNC"), keep = TRUE)
# A tibble: 2 x 3
  v.x    v.y      TRUNC 
  <chr>  <chr>    <chr> 
1 123456 12345604 123456
2 20345  2034512  20345 

You can see there that the v value from df (now named v.x) matches TRUNC. I didn't do that in the original code because TRUNC gets in the way of reshaping the data with pivot_longer().
The inner_join has the property that it only keeps rows where the matching condition is met. There are other kinds of joins. left_join, right_join, semi_join, etc., with different behavior. Joins are very useful for preparing data and are well worth studying.

1 Like

Thanks a lot for the detailed explanation. I understand it now :). Thank you again!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.