Need an algorithm to take data from 5 columns in a specific order to create a new variable

I know, confusing title, not quite sure how to better articulate it so here is a specific example, I am asking for the best code to use, so if someone could point me in the right direction that would be awesome. Even the right package would be appreciated.

I have five variables, they are yes/no and "yes" means the patient had a specific class of drug (a headache drug) and "no" means they did not get that class of drug. I am interested in if people required a second dose of a headache drug. The five columns answer whether or not they got a headache drug, and they are in sequential order, so if the first column is yes and the second column is yes, that patient got a second medication after the first medication and that second medication was a headache drug.

So, I need to write a code that would go from column 1 through column 5, in order, and if there is at any point a yes followed by a yes, the new variable (let's call it rescue) would be "yes". If there was never a "yes" followed by a "yes", then the new column would be "no".

1 Like

Do you want the rescue column to be yes if there was more than one dose in the row, as I have done in the following code, or do you want rescue to be yes only if the two yes columns are adjacent?
I put in several unnecessary print statements to make the logic of the calculation clearer.

library(tidyr)
library(dplyr)

DF <- data.frame(Patient = 1:3, 
                 dose1 = c("yes", "yes", "no"),
                 dose2 = c("yes", "no", "yes"),
                 dose3 = c("no", "yes", "no"),
                 dose4 = c("no", "yes", "no"),
                 dose5 = c("no", "no", "no"), stringsAsFactors = FALSE)
DF
#>   Patient dose1 dose2 dose3 dose4 dose5
#> 1       1   yes   yes    no    no    no
#> 2       2   yes    no   yes   yes    no
#> 3       3    no   yes    no    no    no
RescueDF <- DF %>% gather(key = Dose, value = Value, dose1:dose5) %>% 
  group_by(Patient) %>% 
  summarize(CountDoses = sum(Value == "yes"))
RescueDF
#> # A tibble: 3 x 2
#>   Patient CountDoses
#>     <int>      <int>
#> 1       1          2
#> 2       2          3
#> 3       3          1
RescueDF <- RescueDF %>% mutate(rescue = ifelse(CountDoses > 1, "yes", "no"))
RescueDF
#> # A tibble: 3 x 3
#>   Patient CountDoses rescue
#>     <int>      <int> <chr> 
#> 1       1          2 yes   
#> 2       2          3 yes   
#> 3       3          1 no
#If you want to attach the result to the original data
DF <- inner_join(DF, RescueDF, by = "Patient")
DF
#>   Patient dose1 dose2 dose3 dose4 dose5 CountDoses rescue
#> 1       1   yes   yes    no    no    no          2    yes
#> 2       2   yes    no   yes   yes    no          3    yes
#> 3       3    no   yes    no    no    no          1     no

Created on 2019-12-25 by the reprex package (v0.2.1)

2 Likes

@FJCC It needs to be yes if a "yes" is at any point followed by a "yes", does not need to be adjacent. Could be in any column between column 1 and column 5. So they can be anywhere from adjacent to 5 columns apart.

1 Like

If my answer above works for you, please mark it as the solution.

When in doubt, it always helps to start with some data and torture it into a usable form to be able to create a reproducible example, called a reprex to illuminate the problem.

There's no better place to start than mtcars. Granted it deals with automobiles, but you can use it to create a dummy data set for your problem:

suppressPackageStartupMessages(library(dplyr))
exemplar <- mtcars %>% select(1:5)
rownames(exemplar) <- seq(1:nrow(exemplar))
exemplar <- exemplar %>% 
  mutate(mpg = ifelse(mpg < 20, 1, 0)) %>% 
  mutate(cyl = ifelse(cyl < 6, 1, 0)) %>% 
  mutate(disp = ifelse(disp < 200, 1, 0)) %>% 
  mutate(hp = ifelse(hp < 150, 1, 0)) %>% 
  mutate(drat = ifelse(drat < 3, 1, 0))
colnames(exemplar) <- c("Drug1","Drug2","Drug3","Drug4","Drug5")
exemplar
#>    Drug1 Drug2 Drug3 Drug4 Drug5
#> 1      0     0     1     1     0
#> 2      0     0     1     1     0
#> 3      0     1     1     1     0
#> 4      0     0     0     1     0
#> 5      1     0     0     0     0
#> 6      1     0     0     1     1
#> 7      1     0     0     0     0
#> 8      0     1     1     1     0
#> 9      0     1     1     1     0
#> 10     1     0     1     1     0
#> 11     1     0     1     1     0
#> 12     1     0     0     0     0
#> 13     1     0     0     0     0
#> 14     1     0     0     0     0
#> 15     1     0     0     0     1
#> 16     1     0     0     0     0
#> 17     1     0     0     0     0
#> 18     0     1     1     1     0
#> 19     0     1     1     1     0
#> 20     0     1     1     1     0
#> 21     0     1     1     1     0
#> 22     1     0     0     0     1
#> 23     1     0     0     0     0
#> 24     1     0     0     0     0
#> 25     1     0     0     0     0
#> 26     0     1     1     1     0
#> 27     0     1     1     1     0
#> 28     0     1     1     1     0
#> 29     1     0     0     0     0
#> 30     1     0     1     0     0
#> 31     1     0     0     0     0
#> 32     0     1     1     1     0

Created on 2019-12-25 by the reprex package (v0.3.0)

I've chosen 1/0 in place of yes/no in @FJCC's example, because there are advantages for when you need to use linear algebra, which isn't as scary as it might sound.

Given the faked exemplar data set ,

rescue <- exemplar %>% mutate(rescue = ifelse(
                        Drug1 == 1 & Drug2 == 1 |
                        Drug2 == 1 & Drug3 == 1 |
                        Drug3 == 1 & Drug4 == 1 |
                        Drug4 == 1 & Drug5 == 1,
                        1,0))
rescue
#>    Drug1 Drug2 Drug3 Drug4 Drug5 rescue
#> 1      0     0     1     1     0      1
#> 2      0     0     1     1     0      1
#> 3      0     1     1     1     0      1
#> 4      0     0     0     1     0      0
#> 5      1     0     0     0     0      0
#> 6      1     0     0     1     1      1
#> 7      1     0     0     0     0      0
#> 8      0     1     1     1     0      1
#> 9      0     1     1     1     0      1
#> 10     1     0     1     1     0      1
#> 11     1     0     1     1     0      1
#> 12     1     0     0     0     0      0
#> 13     1     0     0     0     0      0
#> 14     1     0     0     0     0      0
#> 15     1     0     0     0     1      0
#> 16     1     0     0     0     0      0
#> 17     1     0     0     0     0      0
#> 18     0     1     1     1     0      1
#> 19     0     1     1     1     0      1
#> 20     0     1     1     1     0      1
#> 21     0     1     1     1     0      1
#> 22     1     0     0     0     1      0
#> 23     1     0     0     0     0      0
#> 24     1     0     0     0     0      0
#> 25     1     0     0     0     0      0
#> 26     0     1     1     1     0      1
#> 27     0     1     1     1     0      1
#> 28     0     1     1     1     0      1
#> 29     1     0     0     0     0      0
#> 30     1     0     1     0     0      0
#> 31     1     0     0     0     0      0
#> 32     0     1     1     1     0      1

Created on 2019-12-25 by the reprex package (v0.3.0)

ifelse is a logical test in a series of | OR operators that looks at successive pairs of the Drug_i columns to see if there is a 1 followed by a 1 in any of the pairs: 1:2, 2:3, 3:4, or 4:5.

Which variable should I pick to replace Patient?

I saw @drchinn's non-adjacency condition, which simplifies matters considerably.

suppressPackageStartupMessages(library(dplyr))
exemplar <- mtcars %>% select(1:5)
rownames(exemplar) <- seq(1:nrow(exemplar))
exemplar <- exemplar %>% 
  mutate(mpg = ifelse(mpg < 20, 1, 0)) %>% 
  mutate(cyl = ifelse(cyl < 6, 1, 0)) %>% 
  mutate(disp = ifelse(disp < 200, 1, 0)) %>% 
  mutate(hp = ifelse(hp < 150, 1, 0)) %>% 
  mutate(drat = ifelse(drat < 3, 1, 0))
colnames(exemplar) <- c("Drug1","Drug2","Drug3","Drug4","Drug5")
m <- as.matrix(exemplar)
rescues <- rowSums(m) > 1
exemplar <- cbind(exemplar, rescues*1)
colnames(exemplar) <- c("Drug1","Drug2","Drug3","Drug4","Drug5", "rescue")
exemplar
#>    Drug1 Drug2 Drug3 Drug4 Drug5 rescue
#> 1      0     0     1     1     0      1
#> 2      0     0     1     1     0      1
#> 3      0     1     1     1     0      1
#> 4      0     0     0     1     0      0
#> 5      1     0     0     0     0      0
#> 6      1     0     0     1     1      1
#> 7      1     0     0     0     0      0
#> 8      0     1     1     1     0      1
#> 9      0     1     1     1     0      1
#> 10     1     0     1     1     0      1
#> 11     1     0     1     1     0      1
#> 12     1     0     0     0     0      0
#> 13     1     0     0     0     0      0
#> 14     1     0     0     0     0      0
#> 15     1     0     0     0     1      1
#> 16     1     0     0     0     0      0
#> 17     1     0     0     0     0      0
#> 18     0     1     1     1     0      1
#> 19     0     1     1     1     0      1
#> 20     0     1     1     1     0      1
#> 21     0     1     1     1     0      1
#> 22     1     0     0     0     1      1
#> 23     1     0     0     0     0      0
#> 24     1     0     0     0     0      0
#> 25     1     0     0     0     0      0
#> 26     0     1     1     1     0      1
#> 27     0     1     1     1     0      1
#> 28     0     1     1     1     0      1
#> 29     1     0     0     0     0      0
#> 30     1     0     1     0     0      1
#> 31     1     0     0     0     0      0
#> 32     0     1     1     1     0      1

Created on 2019-12-25 by the reprex package (v0.3.0)

Recall my remark about linear algebra. Row-by-row operations in tidy settings are clunky; in matrix form they are fast. Since your test was simply are there two yes's (1s) in this patient record we can just take the sum of each row of a matrix derived from exemplar and test if it adds up to 2 or more. Then bind back to exemplar and rename the columns. This is necessary because the test of the matrix sum being greater than one returns a logical TRUE/FALSE vector.

1 Like

I would have solved this along the same lines as @technocrat, however, I would probably skip converting the data frame to a matrix. rowSums can do that by itself as long as it knows which columns you are working with.

Starting with @technocrat's exemplar data frame

drug_col <- c("Drug1","Drug2","Drug3","Drug4","Drug5")
exemplar$rescue <- rowSums(exemplar[drug_col]) > 1

If your data frame has character columns (yes/no instead of 1/0), this can be modified to use apply.

drug_col <- c("Drug1","Drug2","Drug3","Drug4","Drug5")
exemplar$rescue <- apply(exemplar[drug_col], 
                         MARGIN = 1, 
                         function(x) sum(x %in% "yes") > 1)
3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.