Join two tables by index range but table length increased

I'm trying to follow this answer that join two tables with range: https://stackoverflow.com/a/46341899/6636572

I want to join two tables where one has some ranges and another are numbers and I want to annotate the numbers by matching which range those numbers are in from another dataframe.

However the length of the table surprisingly increased. I expect it to stay the same as 2832. What's happening? How to trouble shoot this? The result doesn't change if I use fuzzy_left_join

> head(gene_list_selected)
  chr start_pos  end_pos  gene_name
1   1  55013806 55100417     ACOT11
2   1  55074849 55089200    FAM151A
3   1  55107412 55175940      MROH7
4   1  55107412 55208328 MROH7-TTC4
5   1  55181494 55208328       TTC4
6   1  55222570 55230226      PARS2
> head(df)
            rsid      pos
1 1:55013860:C:T 55013860
2 1:55013957:G:A 55013957
3 1:55014013:C:T 55014013
4 1:55014095:C:T 55014095
5 1:55014099:C:T 55014099
6 1:55014100:G:A 55014100
> nrow(gene_list_selected)
[1] 21
> nrow(df)
[1] 2832
> df_with_gene_name<-df %>% 
+     fuzzy_inner_join(gene_list_selected, by = c("pos"="start_pos","pos"="end_pos"), match_fun = list(`>=`, `<=`)) 
> 
> nrow(df_with_gene_name)
[1] 3298

I want the end result to keep rsid, pos, from, start while having additional columns of one-hot coded binary table of gene_name.

> result
rsid, chr, pos, from, start, ACOT11, FAM151A, MROH7, MROH7-TTC4, TTC4, PARS2
1:55013860:C:T 1 55013860 55013806 55100417 1 0 0 0 0 0

Looking at the start_pos and end_pos of ACOT11, it runs from 5501xxxx to 5510xxxx. FAM151A covers 5507xxx to 5508xxxx. So something that matches FAM151A will also match ACOT11. Or am I looking at that wrong?

1 Like

It's right. I added a result that I'd like to get. There will be overlaps.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.