Extract values falling within multiple ranges of numbers

Dallak · February 10, 2022, 5:52am

I would like to extract values that occurs between two numbers. However, I have multiple ranges of numbers and between() seems not to work in this respect.

ex <- data.frame('id'= seq(1:26), 'day'= c(105:115, 1:12,28:30), 'letter' = LETTERS[1:26], s = rep(1:26, each = 3, len = 26) )

structure(list(id = 1:26, day = c(105L, 106L, 107L, 108L, 109L, 
110L, 111L, 112L, 113L, 114L, 115L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 11L, 12L, 28L, 29L, 30L), letter = c("A", "B", "C", 
"D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", 
"Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"), s = c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 
6L, 7L, 7L, 7L, 8L, 8L, 8L, 9L, 9L)), row.names = c(NA, 26L), class = "data.frame")

Specifically, I want to extract values based on id. For example, I am interested in the data that falls within 1:3, 7:9, 12:15, 17:19, etc.

The following ways work but on a manual basis, which is time-consuming as I have a big dataset.

1- filter(ex, between(id, 1,3))

2- vec <- c(1:3)
ex%>% filter(id%in% vec)

The values I'm interested in are already stored in two vectors, so working from vectors would be preferable, but necessary. What is stored in the vectors are the two ranges - something like (not sure if this practical):
v1 <- c(1,7,12,17)
v2 <- c(3,9,15,19)

The output I'm looking to is similar to this.

   id day letter s
1   1 105      A 1
2   2 106      B 1
3   3 107      C 1
4   7 111      G 3
5   8 112      H 3
6   9 113      I 3
7  12   1      L 4
8  13   2      M 5
9  14   3      N 5
10 15   4      O 5
11 17   6      Q 6
12 18   7      R 6
13 19   8      S 7

Thank you in advance!

technocrat · February 10, 2022, 6:14am

Sometimes tidyverse makes things more complicated than they need to be.

ex <- data.frame('id'= seq(1:26), 'day'= c(105:115, 1:12,28:30), 'letter' = LETTERS[1:26], s = rep(1:26, each = 3, len = 26))

picks <- c(1:3, 7:9, 12:15, 17:19)
ex[picks,]
#>    id day letter s
#> 1   1 105      A 1
#> 2   2 106      B 1
#> 3   3 107      C 1
#> 7   7 111      G 3
#> 8   8 112      H 3
#> 9   9 113      I 3
#> 12 12   1      L 4
#> 13 13   2      M 5
#> 14 14   3      N 5
#> 15 15   4      O 5
#> 17 17   6      Q 6
#> 18 18   7      R 6
#> 19 19   8      S 7

# constructing picks from two vectors with starts and ends of sequences
start <- c(1,7,12,17)
ends <- c(3,9,15,19)

make_picks <- function(x) start[x]:ends[x]

picks <- unlist(sapply(1:4,make_picks))

ex[picks,]                             
#>    id day letter s
#> 1   1 105      A 1
#> 2   2 106      B 1
#> 3   3 107      C 1
#> 7   7 111      G 3
#> 8   8 112      H 3
#> 9   9 113      I 3
#> 12 12   1      L 4
#> 13 13   2      M 5
#> 14 14   3      N 5
#> 15 15   4      O 5
#> 17 17   6      Q 6
#> 18 18   7      R 6
#> 19 19   8      S 7

Dallak · February 10, 2022, 7:39am

Thank you, @technocrat for your prompt reply.
This works perfectly on 'id' column, how about if I want to extend it to another column such as 'day', and I want to extract days between 107:109, 115:4, 10:28. All of these are stored in vectors:

start <- c(107,115,10)
ends <- c(109,4,28).

Thank you again

technocrat · February 10, 2022, 7:58am

It works differently from using id, which was identical to the row number, so it could be used directly to subset ex like

# first row (or id) of ex, all columns
ex[1,]

I introduce a more elaborate subset operation by first finding the rows in ex that are in day_picks (found the same way as before—manually setting the number of start:end pairs with 1:3 could be abstracted if there was going to be the potential for a lot of variability) by using the which() function. The setdiff function is applied to that result, together with the original ex.

ex <- data.frame('id'= seq(1:26), 'day'= c(105:115, 1:12,28:30), 'letter' = LETTERS[1:26], s = rep(1:26, each = 3, len = 26))
day_starts <- c(107,115,10)
# 4 comes before 115
#day_ends <- c(109,4,28)
day_ends <- c(109,116,28)

day_picks <- function(x) day_starts[x]:day_ends[x]
picks <- unlist(sapply(1:3,day_picks))
picks
#>  [1] 107 108 109 115 116  10  11  12  13  14  15  16  17  18  19  20  21  22  23
#> [20]  24  25  26  27  28
# no longer works
# ex[picks,]
# because in original post id was identical
# row number, but in this example day is not

# find the portion that should be excluded
# with `which` and use it to subset ex
# then "subtract" it from ex
setdiff(ex[which(ex[,"day"] %in% picks),],ex)
#>    id day letter s
#> 3   3 107      C 1
#> 4   4 108      D 2
#> 5   5 109      E 2
#> 11 11 115      K 4
#> 21 21  10      U 7
#> 22 22  11      V 8
#> 23 23  12      W 8
#> 24 24  28      X 8
#

Editorial comment:

is style of expression that most of us flee from. Perhaps it is painful memories of being corrected for punctuation errors in school both in writing and algebra. At some level of nesting, our eyeballs roll back and we zone out. Part of successes enjoyed by programs in the tidyverse comes from relieving that anxiety.

It comes at a subtle cost, however, which is due to replacing the nested style of punctuation with bit size "verbs*. So, using {dplyr} we would find the days in day_picks with

ex %>% filter(day %in% day_picks) ...

The "cost" is that user focus shifts from what am I trying to do? to how do I do what I think I want to do?

The power of R as it is presented to the user is it's functional orientation, the focus on what. f(x) = y. Given an object, x and a desired object, y, what function (which may be composite like f(g(x))) can transform x to y?

Unpacking the snippet reveals the question: Which are the elements of ex that should be excluded because they occur in picks? We see that %in% does the occurrence part, which() identifies the rows of ex affected and setdiff does the exclusion. Thinking of it that way not only makes the question aptly put but using the subset operator [ ...] only requires remembering that ex[1] refers to the first column, ex[1,] refers to the first row and ex[1,1 refers to the entry in the first column of the first row and that 1 can be replaced with a range 1:3, as can column, or a vector c(1,4,5) or negated [-2,-4].

That simplifies programmer syntax burden. Even after 15 years, I still mix up filter() and selec() in {dplyr}. But I never have trouble with subsetting in {base}.

Dallak · February 10, 2022, 8:11am

Thank you, @technocrat!

I got this output:

[1] id day letter s
<0 rows> (or 0-length row.names)

technocrat · February 10, 2022, 8:25am

Cutting and pasting the reprex is all but guaranteed to return identical results. So, either something was introduced by hand that is different or the data objects, including the data frame and two vectors differ.

What were you using that produced the zero row result?

nirgrahamuk · February 10, 2022, 8:37am

ex <- data.frame('id'= seq(1:26),
                 'day'= c(105:115, 1:12,28:30),
                 'letter' = LETTERS[1:26],
                 s = rep(1:26, each = 3, len = 26) )

id1 <- c(1,7,12,17)
id2 <- c(3,9,15,19)

day1 <- c(107,115,10)
day2 <- c(109,4,28)




library(tidyverse)
(id_entries <- map2(id1,id2,~.x:.y) %>% unlist() %>% unique)
(day_entries <- map2(day1,day2,~.x:.y) %>% unlist()%>% unique)

ex %>% filter(id %in% id_entries)
ex %>% filter(day %in% day_entries)
#both
ex %>% filter(id %in% id_entries,
              day %in% day_entries)
#either
ex %>% filter(id %in% id_entries | 
              day %in% day_entries)

Dallak · February 11, 2022, 3:32am

Thank you both for your time and help, @nirgrahamuk's solution works perfectly.
@technocrat, I copied and pasted the reprexas is but still getting the same error. Not sure why, but don't worry as @nirgrahamuk's answer works.

Thank you again both!

system · February 18, 2022, 3:32am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.