What does the error message mean?

Hi, I'm looking to find total number of unique combinations of 3 diseases within a group of 20 conditions (factorial). I have code from a reprex that works, and I've made my csv the same shape (diseases begin from column 6 onwards), but it throws and error message when using the real file. I want to find all possible combinations and calculate prevalence of each combination, to then plot as mean and sd. What is the difference between the csv and reprex? (Reprex right at the bottom).

Thanks

Code:

library(tidyverse)
library(utils)

dat <- read_csv("005_trimmed_spice.csv")

dat <- dat[,-c(3,20)]

dat$comorbid <- FALSE

comorbids <- dat[which(rowSums(dat[,7:20]) > 2),1]
dat[comorbids,"comorbid"] <- TRUE

cases <- combn(7:20,3)

dat[,cases[,1]]

make_comb <- function(x) dat[which(rowSums(dat[,cases[,x]]) > 2),1]

show_result <- function(x) dat[dat[make_comb(x)][which(rowSums(dat[,cases[,1]]) > 2),1],]

show_result(1)

show_result(2)

apply(cases, 2, show_result)

Console:

dat <- read_csv("005_trimmed_spice.csv")
New names: 0s

  • `` -> ...47
  • `` -> ...48
  • `` -> ...49
  • `` -> ...50
  • `` -> ...51
  • ...
    Rows: 65534 Columns: 86
    ── Column specification ─────────────────────────────────────────────
    Delimiter: ","
    chr (1): age_group
    dbl (45): UniquePatientID, Age, Sex, CarstairsQuintile, Carstairs...
    lgl (40): ...47, ...48, ...49, ...50, ...51, ...52, ...53, ...54,...

:information_source: Use spec() to retrieve the full column specification for this data.
:information_source: Specify the column types or set show_col_types = FALSE to quiet this message.

dat <- dat[,-c(3,20)]

dat$comorbid <- FALSE

comorbids <- dat[which(rowSums(dat[,7:20]) > 2),1]
dat[comorbids,"comorbid"] <- TRUE
Error: Must assign to rows with a valid subscript vector.
x Subscript comorbids has the wrong type tbl_df<UniquePatientID:double>.
:information_source: It must be logical, numeric, or character.
Run rlang::last_error() to see where the error occurred.

cases <- combn(7:20,3)

dat[,cases[,1]]

A tibble: 65,534 x 3

Depression PainfulCondition ActiveAsthma

1 0 0 0
2 0 0 1
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 1 0
10 0 0 0

… with 65,524 more rows

make_comb <- function(x) dat[which(rowSums(dat[,cases[,x]]) > 2),1]

show_result <- function(x) dat[dat[make_comb(x)][which(rowSums(dat[,cases[,1]]) > 2),1],]

show_result(1)
Error: Must subset columns with a valid subscript vector.
x Subscript make_comb(x) has the wrong type tbl_df<UniquePatientID:double>.
:information_source: It must be logical, numeric, or character.
Run rlang::last_error() to see where the error occurred. >
show_result(2)
Error: Must subset columns with a valid subscript vector.
x Subscript make_comb(x) has the wrong type tbl_df<UniquePatientID:double>.
:information_source: It must be logical, numeric, or character.
Run rlang::last_error() to see where the error occurred. >
apply(cases, 2, show_result)
Error: Must subset columns with a valid subscript vector.
x Subscript cases[, x] must be a simple vector, not a matrix.
Run rlang::last_error() to see where the error occurred.

Practice reprex where code above worked:

ID =
c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
Age =
c(18, 77, 25, 30, 54, 78, 69, 62, 68, 63),
Sex =
c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
CarsQuintie =
c(2, 1, 3, 1, 1, 5, 1, 1, 5, 1),
age_group =
c("18 - 24", "65 - 74", "25 - 34", "25 - 34", "55 - 64", "75 - 84", "65 - 74", "55 - 64", "55 - 64", "55 - 64"),
CarsQuintie_group =
c(3, 1, 4, 3, 1, 5, 1, 2, 1, 3),
Diabetes =
c(1, 0, 0, 0, 0, 1, 1, 0, 1, 1),
Asthma =
c(1, 1, 0, 0, 0, 1, 1, 0, 1, 0),
Stroke =
c(0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
Heart.attack =
c(1, 1, 0, 0, 0, 1, 1, 0, 1, 1),
COPD =
c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
Hypertension =
c(0, 0, 1, 0, 1, 0, 1, 0, 0, 0),
Eczema =
c(0, 1, 0, 0, 1, 0, 0, 0, 1, 0),
Depression =
c(0, 0, 0, 1, 0, 0, 0, 1, 0, 0))

1 Like

Subsets functions require integer addresses x,y = row,column It's not possible to index on either type logical or type double directly. What's needed is to use the data you have with which to perform some logical operation that results in identifying the rows of the data where the condition is satisfied.

Thank you very much. Why does the csv need subsets function and the reprex didn't? Do I need to change the data type to integer and then the current code would work? Very early in R learning so not sure how to perform which logical operation, where would this go and what would it look like?

Many thanks again

We all start out in R feeling like an ant attempting to eat its way through the Amazon rain forest. Hadley has a helpful insight into subsets

This may help

# find the row numbers of cars in the mtcars dataset with 4 cylinders
which(mtcars$cyl == 4)
#>  [1]  3  8  9 18 19 20 21 26 27 28 32
# use that directly to subset
mtcars[which(mtcars$cyl == 4),] # equal to only these rows and ',' all columns
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
# make it easier to follow by creating a variable with the indices
dex <- which(mtcars$cyl == 4)
# use it to subset
mtcars[dex,] # again, note the trailing comma
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

In teasing out these sorts of problems, it's very easy to get distracted by all the punctuation, which is why no one should feel embarrassed to break the problem down into baby steps.

Thank you very much, and that is definitely what it feels like at the moment! So is the problem that the csv variables are dbl but they should be numeric? Could I just change the csv file with as.numeric() then the existing code would work? The code you sent worked perfectly with the reprex.

Many thanks

No, it's not the data, its addressing the data.

a_data <- mtcars[3,1] # Datsun's row, column index
a_data # Datsun mileage, not an integer
#> [1] 22.8
# can't subset with the data if it's not an integer
mtcars[a_data,] # well, you can, but it's nonsense
#>                   mpg cyl disp  hp drat   wt  qsec vs am gear carb
#> Dodge Challenger 15.5   8  318 150 2.76 3.52 16.87  0  0    3    2
# index based on whether there are only 3 gears
three_speed <- which(mtcars$gear == 3) 
# rows with three-speed gear values
three_speed
#>  [1]  4  5  6  7 12 13 14 15 16 17 21 22 23 24 25
# use those rows to subset
mtcars[three_speed,]
#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
# row value -- greater than number of rows
mtcars[100,]
#>    mpg cyl disp hp drat wt qsec vs am gear carb
#> NA  NA  NA   NA NA   NA NA   NA NA NA   NA   NA
# ditto column
mtcars[,100]
#> Error in `[.data.frame`(mtcars, , 100): undefined columns selected

Thanks. I'm not sure I follow. Do you mean making a subset out of the df? I need the age_group and CarsQuintile to remain in the df so results can be calculated for each population. If I use the which function then I would just be selecting the columns I have already? Not sure how the data not being integer form is dealt with in the code above? Can see how the subset is made which is helpful. I think the different between your example code and what I need must be easy, how would I change the variables to integer or numeric in this code?

Many thanks again

To subset a data frame means selecting rows and columns that meet conditions. which will perform tests of conditions and return the index location, not the value of the data frame where those conditions are met. With the return value of which, it is possible to use the [ operator to return only that portion of the data frame contents that meet the conditions.

The latest part of this thread has been on the mechanics of subsets and the difference between using data values, which is seldom what is needed, and index locations, which is almost always what is needed. It does not matter to the index what value the data it points to is--it can be an integer, a double, a logical or a character. What matters is that it represents a particular row,column address.

Many thanks. Just wondering where that fits into this code - why does calling the variable location work for the reprex and not for my csv file? Do you mean put in a where() at this row: dat <- dat[,-c(3,20)]?