Measure of duration. Amount of time (# of columns) that elapses from first cohort to the start of next cohort

I have the following data frame with monthly observations of n individuals.

df <- structure(list(
  ID = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"	),
  Gender = c("M","M","F","M","F","F","F","M","F","F"),
  y_0101 = c(0,2,0,0,0,3,0,0,0,3),
  y_0102 = c(6,2,0,0,2,0,2,0,6,0),
  y_0103 = c(0,0,1,0,0,0,0,4,0,0),
  y_0104 = c(0,0,0,2,0,3,0,2,0,2),
  y_0105 = c(2,2,1,4,5,3,4,0,5,2),
  y_0106 = c(2,2,1,4,5,5,6,4,3,0),
  y_0107 = c(2,2,1,4,0,0,6,0,0,4),
  y_0108 = c(0,0,0,1,0,0,7,0,2,8),
  y_0109 = c(2,8,0,0,0,0,0,3,0,0),
  y_0110 = c(2,0,0,2,2,0,0,0,0,4),
  y_0111 = c(0,0,0,0,0,0,0,0,0,0),
  y_0112 = c(0,0,0,0,0,2,0,0,0,0)),
  class = "data.frame", row.names = c(NA, -10L))
df

I want for each row to create a measure that calculates the amount of time (# of columns) that elapses from the first cohort of zeros (could be another value like 'NA' or "sick"), to the start of the next cohort-period depending on some conditions.
This must be done within a specific interval of columns (y_0101:y_0112).

These sums of periods should apply, when two following two conditions holds:

First condition: First interval of 0's should be >= 2 before counted as starting point.

Second condition: next cohort of 0's must be >=2.

The sums of columns between cohorts will be added as a new column (name: "Diff").

Third condition: If more than one intervals, then this should be added as a new column (name: "Diff_2") and so on. If no more than one interval, then NA.

So, it ends up looking like the following:

ID Gender y_0101 y_0102 y_0103 y_0104 y_0105 y_0106 y_0107 y_0108 y_0109 y_0110 y_0111 y_0112 Diff Diff_2
A M 0 6 0 0 2 2 2 0 2 2 0 0 6 NA
B M 2 2 0 0 2 2 2 0 8 0 0 0 5 NA
C F 0 0 1 0 1 1 1 0 0 0 0 0 5 NA
D M 0 0 0 2 4 4 4 1 0 2 0 0 7 NA
E F 0 2 0 0 5 5 0 0 0 2 0 0 2 2
F F 3 0 0 3 3 5 0 0 0 0 0 2 3 NA
G F 0 2 0 0 4 6 6 7 0 0 0 0 4 NA
H M 0 0 4 2 0 4 0 0 3 0 0 0 4 1
I F 0 6 0 0 5 3 0 2 0 0 0 0 4 NA
J F 3 0 0 2 2 0 4 8 0 4 0 0 7 NA

This is a problem involving run-length encoding—what is the length of occurrence of a value in a vector?

suppressPackageStartupMessages({
  library(dplyr)
})

get_diff_dist <- function(x) make_runs(x)[2,3] - (make_runs(x)[1,3] + make_runs(x)[1,1])

get_dist <- function(x) make_runs(x)[2,3] - make_runs(x)[2,1] - make_runs(x)[1,3]

make_runs <- function(x) { tibble(lengths = rle(x)$lengths, values = rle(x)$values) %>% 
  mutate(indices = cumsum(lengths)) %>%
  filter(lengths > 1 & values == 0)
}

# avoid df, data and other names that are also names of functions;
# some operations will give precedence to the built-in resulting
# in treating it as a closure and throwing an error

DF <- data.frame(
  ID = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"   ),
  Gender = c("M","M","F","M","F","F","F","M","F","F"),
  y_0101 = c(0,2,0,0,0,3,0,0,0,3),
  y_0102 = c(6,2,0,0,2,0,2,0,6,0),
  y_0103 = c(0,0,1,0,0,0,0,4,0,0),
  y_0104 = c(0,0,0,2,0,3,0,2,0,2),
  y_0105 = c(2,2,1,4,5,3,4,0,5,2),
  y_0106 = c(2,2,1,4,5,5,6,4,3,0),
  y_0107 = c(2,2,1,4,0,0,6,0,0,4),
  y_0108 = c(0,0,0,1,0,0,7,0,2,8),
  y_0109 = c(2,8,0,0,0,0,0,3,0,0),
  y_0110 = c(2,0,0,2,2,0,0,0,0,4),
  y_0111 = c(0,0,0,0,0,0,0,0,0,0),
  y_0112 = c(0,0,0,0,0,2,0,0,0,0))

m <- DF %>% select(where(is.numeric)) %>% as.matrix()

DIFF <- list()
for (i in 1:10) DIFF[i] = get_dist(m[i,])
DIFF <- unlist(DIFF)

tibble::add_column(DF,DIFF)
#>    ID Gender y_0101 y_0102 y_0103 y_0104 y_0105 y_0106 y_0107 y_0108 y_0109
#> 1   A      M      0      6      0      0      2      2      2      0      2
#> 2   B      M      2      2      0      0      2      2      2      0      8
#> 3   C      F      0      0      1      0      1      1      1      0      0
#> 4   D      M      0      0      0      2      4      4      4      1      0
#> 5   E      F      0      2      0      0      5      5      0      0      0
#> 6   F      F      3      0      0      3      3      5      0      0      0
#> 7   G      F      0      2      0      0      4      6      6      7      0
#> 8   H      M      0      0      4      2      0      4      0      0      3
#> 9   I      F      0      6      0      0      5      3      0      2      0
#> 10  J      F      3      0      0      2      2      0      4      8      0
#>    y_0110 y_0111 y_0112 DIFF
#> 1       2      0      0    6
#> 2       0      0      0    5
#> 3       0      0      0    5
#> 4       2      0      0    7
#> 5       2      0      0    2
#> 6       0      0      2    3
#> 7       0      0      0    4
#> 8       0      0      0    4
#> 9       0      0      0    4
#> 10      4      0      0    7

OP <- c(6,5,5,7,2,3,4,4,4,7)

OP == DIFF
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

The last line compares the DIFF column in the image posted with the results of the preceding code, and shows no difference.

Every R problem can be thought of with advantage as the interaction of three objects— an existing object, x , a desired object,y , and a function, f, that will return a value of y given x as an argument. In other words, school algebra— f(x) = y. Any of the objects can be composites.

In this case, x is withinDF and y is OP. f is composite. Given OP, it is possible add it as a column to DF, but that is trivial, as shown in the code.

x has first to be extracted from DF, which is the office of

m <- DF %>% select(where(is.numeric)) %>% as.matrix()

m is of dimension 10, 12 and we will need some function that will treat each row separately, which is

for (i in 1:10) DIFF[i] = get_dist(m[i,])

DIFF here begins as an empty list and operates as a receiver object to accumulate the results of the application of the get_dist function.

get_dist is a composite function that extracts specific elements from the return value of make_runs, which returns an object in the form

> make_runs(m[1,])
# A tibble: 2 x 3
  lengths values indices
    <int>  <dbl>   <int>
1       2      0       4
2       2      0      12

through construction initially of a tibble containing return values of the rle function, which looks like

> m[1,] -> a
> attributes(a) <- NULL
> a
 [1] 0 6 0 0 2 2 2 0 2 2 0 0
> rle(a)
Run Length Encoding
  lengths: int [1:7] 1 1 2 3 1 2 2
  values : num [1:7] 0 6 0 2 0 2 0

lengths is how many times the corresponding values is repeated. From there,

mutate(indices = cumsum(lengths))

provides the index of the end of each run, and

filter(lengths > 1 & values == 0)

narrows the results to runs of 2 or more values of 0.

Returning to get_dist, it was necessary to examine the image infer that the quantity desired was the number of columns desired was exclusive of the ending column of the first cohort and the beginning column of the last. The logic to derived DIFF2 was not similarly apparent.

1 Like

Thank you technocrat. This was what I was looking for!

If I after this want to calculate the sum of columns that are included from the first point of the first cohort of zeros, to the last point of the second cohort of zeros, with the some conditions. (see person B),
(Condition 1): after the second cohort, there must be a course of at least three columns of " " in order to be counted, else NA.
(Condition 2): between cohorts there must not be more than two "2" and
(Condition 3): between cohorts, there must not be more than three "44"'s.

Like so

ID Gender y_0101 y_0102 y_0103 y_0104 y_0105 y_0106 y_0107 y_0108 y_0109 y_0110 y_0111 y_0112 Course
A M 0 6 0 0 2 2 3 0 0 7
B F 0 0 2 2 0 44 44 0 0 9

Does it make sense?

Yes. It's a matter of composition of functions to 1) identify distinct patterns of sequences 2) apply boolean operators and 3) using indices to identify the vector positions from the rle object

> rle(a)
Run Length Encoding
  lengths: int [1:7] 1 1 2 3 1 2 2
  values : num [1:7] 0 6 0 2 0 2 0

We already have the form of the function to perform boolean operations

filter(lengths > 1 & values == 0)

and similarly constructing indices

mutate(indices = cumsum(lengths))

as well as calculating "distances" between sequences

get_diff_dist <- function(x) make_runs(x)[2,3] - (make_runs(x)[1,3] + make_runs(x)[1,1])

Note that the indices so calculated reflect the end position indices—and the arithmetic performed with the row,column subsets in make_runs(x)[2,3] would need to be adjusted to come up with beginning positions.

What can make this kind of problem challenging is that it requires close analysis, which is something requiring patient effort. When we see an equation (and, after all, functions in R are just that), the natural reaction is for our eyes to roll to the back of our heads. Breaking the problem, like breaking the equation, into its atomic components is the only answer.

Thank you for the quick response.

I'm having trouble with understanding the meaning of make_runs(x), and how to adjust it in order to get the beginning position in get_diff_dist

make_runs constructs indices through cumsum, which means that the index to the row in which a value appears will always be its end.

For example in the first row of m (the numeric columns of DF) the first cohort of 0 appears at position 3 and ends at position 4. To obtain the beginning position requires only subtracting the length (2 in this case) from the index to obtain 2, which is where, indeed, the run begins.

It is in get_dist that the work of cutting out the start and end positions of groups is done through the arithmetic on subsets of make_runs

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.