Combn() function to create a factorial

Hi,

I'm trying to create a factorial for 3 conditions out of 8, columns G to N (e.g. 8!/3!(8-3)!). Generate all combinations using a loop with the R combn() function, loop over each combination, then calculate the prevalence for each according to age_group. Is this possible and what ways does conditions_df need to be changed to achieve this?

https://docs.google.com/spreadsheets/d/1LWpVqR2yHLQeZH6pHVbrCl-rHIw6PwSGgahDVzmkzEs/edit#gid=0

Thanks

google sheets are generally a poor way to share content to the forum.
Please review the recommended ways:

Hi, thanks for the advice. Is this what is meant by a reprex?

Thanks

conditions_df(
~ID, ~Age, ~Sex, ~CarsQuintile, ~age_group, ~CarsQuintile_group, ~Diabetes, ~Asthma, ~Stroke, ~Heart.attack, ~COPD, ~Hypertension, ~Eczema, ~Depression,
1L, 18L, 1L, 2L, "18 - 24", 3L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L,
2L, 77L, 1L, 1L, "65 - 74", 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L,
3L, 25L, 1L, 3L, "25 - 34", 4L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
4L, 30L, 1L, 1L, "25 - 34", 3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
5L, 54L, 1L, 1L, "55 - 64", 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L,
6L, 78L, 1L, 5L, "75 - 84", 5L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L,
7L, 69L, 1L, 1L, "65 - 74", 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L,
8L, 62L, 1L, 1L, "55 - 64", 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
9L, 68L, 1L, 5L, "55 - 64", 1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L,
10L, 63L, 1L, 1L, "55 - 64", 3L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L
)

How to get the combinations depends on what is to be done with it next. The simplest approach:

dat <- data.frame(
  ID =
    c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  Age =
    c(18, 77, 25, 30, 54, 78, 69, 62, 68, 63),
  Sex =
    c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
  CarsQuintie =
    c(2, 1, 3, 1, 1, 5, 1, 1, 5, 1),
  age_group =
    c("18 - 24", "65 - 74", "25 - 34", "25 - 34", "55 - 64", "75 - 84", "65 - 74", "55 - 64", "55 - 64", "55 - 64"),
  CarsQuintie_group =
    c(3, 1, 4, 3, 1, 5, 1, 2, 1, 3),
  Diabetes =
    c(1, 0, 0, 0, 0, 1, 1, 0, 1, 1),
  Asthma =
    c(1, 1, 0, 0, 0, 1, 1, 0, 1, 0),
  Stroke =
    c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
  Heart.attack =
    c(1, 1, 0, 0, 0, 1, 1, 0, 1, 1),
  COPD =
    c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
  Hypertension =
    c(0, 0, 1, 0, 1, 0, 1, 0, 0, 0),
  Eczema =
    c(0, 1, 0, 0, 1, 0, 0, 0, 1, 0),
  Depression =
    c(0, 0, 0, 1, 0, 0, 0, 1, 0, 0))

# Sex and COPD are constants
dat <- dat[,-c(3,11)]
# rearrange so that age_group comes first
table(dat)
#> , , CarsQuintie = 1, age_group = 18 - 24, CarsQuintie_group = 1, Diabetes = 0, Asthma = 0, Stroke = 0, Heart.attack = 0, Hypertension = 0, Eczema = 0, Depression = 0
#> 
#>     Age
#> ID   18 25 30 54 62 63 68 69 77 78
#>   1   0  0  0  0  0  0  0  0  0  0
#>   2   0  0  0  0  0  0  0  0  0  0
#>   3   0  0  0  0  0  0  0  0  0  0
#>   4   0  0  0  0  0  0  0  0  0  0
#>   5   0  0  0  0  0  0  0  0  0  0
#>   6   0  0  0  0  0  0  0  0  0  0
#>   7   0  0  0  0  0  0  0  0  0  0
#>   8   0  0  0  0  0  0  0  0  0  0
#>   9   0  0  0  0  0  0  0  0  0  0
#>   10  0  0  0  0  0  0  0  0  0  0
#> 
#> , , CarsQuintie = 2, age_group = 18 - 24, CarsQuintie_group = 1, Diabetes = 0, Asthma = 0, Stroke = 0, Heart.attack = 0, Hypertension = 0, Eczema = 0, Depression = 0
#> 
#>     Age
#> ID   18 25 30 54 62 63 68 69 77 78
#>   1   0  0  0  0  0  0  0  0  0  0
#>   2   0  0  0  0  0  0  0  0  0  0
#>   3   0  0  0  0  0  0  0  0  0  0
#>   4   0  0  0  0  0  0  0  0  0  0
#>   5   0  0  0  0  0  0  0  0  0  0
#>   6   0  0  0  0  0  0  0  0  0  0
#>   7   0  0  0  0  0  0  0  0  0  0
#>   8   0  0  0  0  0  0  0  0  0  0
#>   9   0  0  0  0  0  0  0  0  0  0
#>   10  0  0  0  0  0  0  0  0  0  0
#> 
#> , , CarsQuintie = 3, age_group = 18 - 24, CarsQuintie_group = 1, Diabetes = 0, Asthma = 0, Stroke = 0, Heart.attack = 0, Hypertension = 0, Eczema = 0, Depression = 0
#> 
#>     Age
#> ID   18 25 30 54 62 63 68 69 77 78
#>   1   0  0  0  0  0  0  0  0  0  0
#>   2   0  0  0  0  0  0  0  0  0  0
#>   3   0  0  0  0  0  0  0  0  0  0
#>   4   0  0  0  0  0  0  0  0  0  0
#>   5   0  0  0  0  0  0  0  0  0  0
#>   6   0  0  0  0  0  0  0  0  0  0
#>   7   0  0  0  0  0  0  0  0  0  0
#>   8   0  0  0  0  0  0  0  0  0  0
#>   9   0  0  0  0  0  0  0  0  0  0
#>   10  0  0  0  0  0  0  0  0  0  0
#> 
#> , , CarsQuintie = 5, age_group = 18 - 24, CarsQuintie_group = 1, Diabetes = 0, Asthma = 0, Stroke = 0, Heart.attack = 0, Hypertension = 0, Eczema = 0, Depression = 0
#> 
#>     Age
#> ID   18 25 30 54 62 63 68 69 77 78
#>   1   0  0  0  0  0  0  0  0  0  0
#>   2   0  0  0  0  0  0  0  0  0  0
#>   3   0  0  0  0  0  0  0  0  0  0
#>   4   0  0  0  0  0  0  0  0  0  0
#>   5   0  0  0  0  0  0  0  0  0  0
#>   6   0  0  0  0  0  0  0  0  0  0
#>   7   0  0  0  0  0  0  0  0  0  0
#>   8   0  0  0  0  0  0  0  0  0  0
#>   9   0  0  0  0  0  0  0  0  0  0
#>   10  0  0  0  0  0  0  0  0  0  0
#> ... MUCH more

Thanks, how did you make this transformation?

Cut and pasted to text file, did some light editing to convert to csv, imported with readr::read_csv, cut and pasted the output of dput on the imported object, changed the first function from structure to data.frame and deleted all the trailing metadata.

Thanks very much.

Each row is a person (actual df has 1.3million) where 1/0 tells us if the person has/doesn't have single conditions. I want to examine measures of "mulitmorbidity", which is where a person has multiple conditions (a binary outcome, for example someone who has asthma, stroke, and COPD meets the criteria if the definition is having any 3 conditions).

There are different ways of defining multimorbidity, so useful to see how prevalence changes depending on how many diseases are required to meet the definition. If we define presence of multimorbidity as having 3 or more conditions out of the 8 (columns 7:13), calculate prevalence of people who meet this criteria for all the different combinations for all people within each age_group bin (then plot the distribution as a mean and SD). Repeat if the definition is 4 or more conditions. Is that possible?

Thanks, I'll try and share in this format from now on, appreciate the time you spent on this

1 Like

To make it simpler, I have removed variables not needed at this stage.

Thanks

Every R problem can be thought of with advantage as the interaction of three objects— an existing object, x , a desired object,y , and a function, f, that will return a value of y given x as an argument. In other words, school algebra— f(x) = y. Any of the objects can be composites.

The object x is the data frame; it's composite, consisting of vectors (columns). y is an indicator of comorbidity, which exists if three or more columns have a value of 3 or greater. A composite function f will get us there. They are the subset operator [, the which function that performs logical tests on vectors, and the rowSum function that adds matrix elements by rows, which are vectors. (Just make sure the vector is all numeric.)

dat <- data.frame(
  ID =
    c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  Age =
    c(18, 77, 25, 30, 54, 78, 69, 62, 68, 63),
  Sex =
    c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
  CarsQuintie =
    c(2, 1, 3, 1, 1, 5, 1, 1, 5, 1),
  age_group =
    c("18 - 24", "65 - 74", "25 - 34", "25 - 34", "55 - 64", "75 - 84", "65 - 74", "55 - 64", "55 - 64", "55 - 64"),
  CarsQuintie_group =
    c(3, 1, 4, 3, 1, 5, 1, 2, 1, 3),
  Diabetes =
    c(1, 0, 0, 0, 0, 1, 1, 0, 1, 1),
  Asthma =
    c(1, 1, 0, 0, 0, 1, 1, 0, 1, 0),
  Stroke =
    c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
  Heart.attack =
    c(1, 1, 0, 0, 0, 1, 1, 0, 1, 1),
  COPD =
    c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
  Hypertension =
    c(0, 0, 1, 0, 1, 0, 1, 0, 0, 0),
  Eczema =
    c(0, 1, 0, 0, 1, 0, 0, 0, 1, 0),
  Depression =
    c(0, 0, 0, 1, 0, 0, 0, 1, 0, 0))

# Sex and COPD are constants
dat <- dat[,-c(3,11)]

# create a placeholder variable for result
dat$comorbid <- FALSE

# Exclude non-morbidities from subset
# find rows in which the sub of diseases greater than 2
comorbids <- dat[which(rowSums(dat[,6:12]) > 2),1]
dat[comorbids,"comorbid"] <- TRUE
pander::pander(dat)
ID Age CarsQuintie age_group CarsQuintie_group Diabetes Asthma
1 18 2 18 - 24 3 1 1
2 77 1 65 - 74 1 0 1
3 25 3 25 - 34 4 0 0
4 30 1 25 - 34 3 0 0
5 54 1 55 - 64 1 0 0
6 78 5 75 - 84 5 1 1
7 69 1 65 - 74 1 1 1
8 62 1 55 - 64 2 0 0
9 68 5 55 - 64 1 1 1
10 63 1 55 - 64 3 1 0

Table continues below

Stroke Heart.attack Hypertension Eczema Depression comorbid
0 1 0 0 0 TRUE
1 1 0 1 0 TRUE
0 0 1 0 0 FALSE
0 0 0 0 1 FALSE
0 0 1 1 0 FALSE
0 1 0 0 0 TRUE
0 1 1 0 0 TRUE
0 0 0 0 1 FALSE
0 1 0 1 0 TRUE
0 1 0 0 0 FALSE

Thanks for this very helpful explanation and code.

I would like to work out the prevalence of each specific combination of 3 diseases, for example:

Calculate prevalence in each each age_bin for:

Combination 1: Diabetes, Asthma, Stroke
Combination 2: Diabetes, Asthma, Heart.attack
Combination 3: Diabetes, Asthma, COPD
Combination 4: Diabetes, Asthma, Hypertension
Combination 5: Diabetes, Asthma, Eczema
... and so on (I think there might be total of 56 potential combinations, but might be wrong)

Combine prevalence for each combination to create a distribution with mean and SD, stratified according to age_group bin and then plot?

See if this gets you to the next step.

dat <- data.frame(
  ID =
    c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  Age =
    c(18, 77, 25, 30, 54, 78, 69, 62, 68, 63),
  Sex =
    c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
  CarsQuintie =
    c(2, 1, 3, 1, 1, 5, 1, 1, 5, 1),
  age_group =
    c("18 - 24", "65 - 74", "25 - 34", "25 - 34", "55 - 64", "75 - 84", "65 - 74", "55 - 64", "55 - 64", "55 - 64"),
  CarsQuintie_group =
    c(3, 1, 4, 3, 1, 5, 1, 2, 1, 3),
  Diabetes =
    c(1, 0, 0, 0, 0, 1, 1, 0, 1, 1),
  Asthma =
    c(1, 1, 0, 0, 0, 1, 1, 0, 1, 0),
  Stroke =
    c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
  Heart.attack =
    c(1, 1, 0, 0, 0, 1, 1, 0, 1, 1),
  COPD =
    c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
  Hypertension =
    c(0, 0, 1, 0, 1, 0, 1, 0, 0, 0),
  Eczema =
    c(0, 1, 0, 0, 1, 0, 0, 0, 1, 0),
  Depression =
    c(0, 0, 0, 1, 0, 0, 0, 1, 0, 0))

# Sex and COPD are constants
dat <- dat[,-c(3,11)]

# create a placeholder variable for result
dat$comorbid <- FALSE

# Exclude non-morbidities from subset
# find rows in which the sub of diseases greater than 2
comorbids <- dat[which(rowSums(dat[,6:12]) > 2),1]
dat[comorbids,"comorbid"] <- TRUE

# find the combinations of 7 items (corresponding to the column positions
# of the data frame that indicate a disease type presence) 
cases <- combn(6:12,3)
# example, the combination of Diabetes, Asthma and Stroke
dat[,cases[,1]]
#>    Diabetes Asthma Stroke
#> 1         1      1      0
#> 2         0      1      1
#> 3         0      0      0
#> 4         0      0      0
#> 5         0      0      0
#> 6         1      1      0
#> 7         1      1      0
#> 8         0      0      0
#> 9         1      1      0
#> 10        1      0      0

# function to find if conditions > 2 for any given combination
make_comb <- function(x) dat[which(rowSums(dat[,cases[,x]]) > 2),1]

# function to create a subset by combination with rows that satisfy > 2
show_result <- function(x) dat[dat[make_comb(x)][which(rowSums(dat[,cases[,x]]) > 2),1],]
# example  for first combination, which has no more than 2 per row
show_result(1)
#>  [1] ID                Age               CarsQuintie       age_group        
#>  [5] CarsQuintie_group Diabetes          Asthma            Stroke           
#>  [9] Heart.attack      Hypertension      Eczema            Depression       
#> [13] comorbid         
#> <0 rows> (or 0-length row.names)
# example  for second combination, which has rows with more than 2 per
show_result(2)
#>   ID Age CarsQuintie age_group CarsQuintie_group Diabetes Asthma Stroke
#> 1  1  18           2   18 - 24                 3        1      1      0
#> 6  6  78           5   75 - 84                 5        1      1      0
#> 7  7  69           1   65 - 74                 1        1      1      0
#> 9  9  68           5   55 - 64                 1        1      1      0
#>   Heart.attack Hypertension Eczema Depression comorbid
#> 1            1            0      0          0     TRUE
#> 6            1            0      0          0     TRUE
#> 7            1            1      0          0     TRUE
#> 9            1            0      1          0     TRUE

# do all at one pass
apply(cases, 2, show_result)
#> [[1]]
#>    ID Age CarsQuintie age_group CarsQuintie_group Diabetes Asthma Stroke
#> 1   1  18           2   18 - 24                 3        1      1      0
#> 2   2  77           1   65 - 74                 1        0      1      1
#> 6   6  78           5   75 - 84                 5        1      1      0
#> 7   7  69           1   65 - 74                 1        1      1      0
#> 9   9  68           5   55 - 64                 1        1      1      0
#> 10 10  63           1   55 - 64                 3        1      0      0
#>    Heart.attack Hypertension Eczema Depression comorbid
#> 1             1            0      0          0     TRUE
#> 2             1            0      1          0     TRUE
#> 6             1            0      0          0     TRUE
#> 7             1            1      0          0     TRUE
#> 9             1            0      1          0     TRUE
#> 10            1            0      0          0    FALSE
#> 
#> [[2]]
#>    ID Age CarsQuintie age_group CarsQuintie_group Diabetes Asthma Stroke
#> 1   1  18           2   18 - 24                 3        1      1      0
#> 2   2  77           1   65 - 74                 1        0      1      1
#> 6   6  78           5   75 - 84                 5        1      1      0
#> 7   7  69           1   65 - 74                 1        1      1      0
#> 9   9  68           5   55 - 64                 1        1      1      0
#> 10 10  63           1   55 - 64                 3        1      0      0
#>    Heart.attack Hypertension Eczema Depression comorbid
#> 1             1            0      0          0     TRUE
#> 2             1            0      1          0     TRUE
#> 6             1            0      0          0     TRUE
#> 7             1            1      0          0     TRUE
#> 9             1            0      1          0     TRUE
#> 10            1            0      0          0    FALSE
#> 
#> [[3]]
#>    ID Age CarsQuintie age_group CarsQuintie_group Diabetes Asthma Stroke
#> 1   1  18           2   18 - 24                 3        1      1      0
#> 2   2  77           1   65 - 74                 1        0      1      1
#> 6   6  78           5   75 - 84                 5        1      1      0
#> 7   7  69           1   65 - 74                 1        1      1      0
#> 9   9  68           5   55 - 64                 1        1      1      0
#> 10 10  63           1   55 - 64                 3        1      0      0
#>    Heart.attack Hypertension Eczema Depression comorbid
#> 1             1            0      0          0     TRUE
#> 2             1            0      1          0     TRUE
#> 6             1            0      0          0     TRUE
#> 7             1            1      0          0     TRUE
#> 9             1            0      1          0     TRUE
#> 10            1            0      0          0    FALSE
#> 
#### remaining output omitted

#> [[4]]
#>    ID Age CarsQuintie age_group CarsQuintie_group Diabetes Asthma Stroke
#> 1   1  18           2   18 - 24                 3        1      1      0
#> 2   2  77           1   65 - 74                 1        0      1      1
#> 6   6  78           5   75 - 84                 5        1      1      0
#> 7   7  69           1   65 - 74                 1        1      1      0
#> 9   9  68           5   55 - 64                 1        1      1      0
#> 10 10  63           1   55 - 64                 3        1      0      0
#>    Heart.attack Hypertension Eczema Depression comorbid
#> 1             1            0      0          0     TRUE
#> 2             1            0      1          0     TRUE
#> 6             1            0      0          0     TRUE
#> 7             1            1      0          0     TRUE
#> 9             1            0      1          0     TRUE
#> 10            1            0      0          0    FALSE
#> 
#> [[5]]
#>    ID Age CarsQuintie age_group CarsQuintie_group Diabetes Asthma Stroke
#> 1   1  18           2   18 - 24                 3        1      1      0
#> 2   2  77           1   65 - 74                 1        0      1      1
#> 6   6  78           5   75 - 84                 5        1      1      0
#> 7   7  69           1   65 - 74                 1        1      1      0
#> 9   9  68           5   55 - 64                 1        1      1      0
#> 10 10  63           1   55 - 64                 3        1      0      0
#### remaining output omitted

Hi, thank you so much for this. This is exactly what I am looking for. Sorry late reply, I have spent the last few days trying working on this (very new to R!) on the real csv file. When I run this on the reprex it works, but when running it on the real csv with same structure but longer file I get the following error messages

Error: Must subset columns with a valid subscript vector.
x Subscript cases[, x] must be a simple vector, not a matrix.
Run rlang::last_error() to see where the error occurred.

do all at one pass

apply(cases, 2, show_result)
Error: Must subset columns with a valid subscript vector.
x Subscript cases[, x] must be a simple vector, not a matrix.
Run rlang::last_error() to see where the error occur when it gets to these commands. Do you know why this might be happening with my csv file and not the reprex?

show_result(1)
show_result(2)
apply(cases, 2, show_result)

Error: Must subset columns with a valid subscript vector.
x Subscript cases[, x] must be a simple vector, not a matrix.
Run rlang::last_error() to see where the error occurred.

do all at one pass

apply(cases, 2, show_result)
Error: Must subset columns with a valid subscript vector.
x Subscript cases[, x] must be a simple vector, not a matrix.
Run rlang::last_error() to see where the error occurred.

Sorry meant to say it runs as in the reprex until the following commands:

show_result(1)
show_result(2)
apply(cases, 2, show_result)

when I get the following error messages:

Error: Must subset columns with a valid subscript vector.
x Subscript cases[, x] must be a simple vector, not a matrix.
Run rlang::last_error() to see where the error occurred.

Error: Must subset columns with a valid subscript vector.
x Subscript cases[, x] must be a simple vector, not a matrix.
Run rlang::last_error() to see where the error occurred.

Things to debug:

  1. class(cases)
  2. cases[,3] # or some other column
  3. rlang::last_error() immediately after running apply(cases, 2, show_result)
  4. Make sure that the show_results function is using cases to subset the data, like

Thanks very much. Not sure I understand the first three changes, but I made the final one at the show_result function. Code and errors below. Thank you again.

dat <- dat[,-c(3,45)]

dat$comorbid <- FALSE

comorbids <- dat[which(rowSums(dat[,7:20]) > 2),1]
dat[comorbids,"comorbid"] <- TRUE

cases <- combn(7:20,3)

dat[,cases[,1]]

make_comb <- function(x) dat[which(rowSums(dat[,cases[,x]]) > 2),1]

show_result <- function(x) dat[dat[make_comb(x)][which(rowSums(dat[,cases[,1]]) > 2),1],]

show_result(1)

show_result(2)

apply(cases, 2, show_result)

Shows this in console:

dat <- dat[,-c(3,45)]

dat$comorbid <- FALSE

comorbids <- dat[which(rowSums(dat[,7:20]) > 2),1]
dat[comorbids,"comorbid"] <- TRUE
Error: Must assign to rows with a valid subscript vector.
x Subscript comorbids has the wrong type tbl_df<UniquePatientID:double>.
:information_source: It must be logical, numeric, or character.
Run rlang::last_error() to see where the error occurred.

cases <- combn(7:20,3)

dat[,cases[,1]]

A tibble: 65,534 x 3

Depression PainfulCondition ActiveAsthma

1 0 0 0
2 0 0 1
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 1 0
10 0 0 0

… with 65,524 more rows

make_comb <- function(x) dat[which(rowSums(dat[,cases[,x]]) > 2),1]

show_result <- function(x) dat[dat[make_comb(x)][which(rowSums(dat[,cases[,1]]) > 2),1],]

show_result(1)
Error: Must subset columns with a valid subscript vector.
x Subscript make_comb(x) has the wrong type tbl_df<UniquePatientID:double>.
:information_source: It must be logical, numeric, or character.
Run rlang::last_error() to see where the error occurred. >
show_result(2)
Error: Must subset columns with a valid subscript vector.
x Subscript make_comb(x) has the wrong type tbl_df<UniquePatientID:double>.
:information_source: It must be logical, numeric, or character.
Run rlang::last_error() to see where the error occurred. >
apply(cases, 2, show_result)
Error: Must subset columns with a valid subscript vector.
x Subscript cases[, x] must be a simple vector, not a matrix.
Run rlang::last_error() to see where the error occurred.