Turning all table elements into column IDs then new table 0 or 1 if the patient (row ID) had that element

HariSeldon · July 5, 2021, 10:06pm

I need to create a table that has patient names in the first column then every column after that contains either a zero or one depending on whether the patient had that disease. Also I need to know how to keep the tibble TSV small because when I fill it with NA instead of 0 the table is ~130 GB. Is integer (0 or 1) or boolean/logical (TRUE/FALSE) the smallest?

CURRENT DATA anonymized stub example (there are blank spaces that need to be ignored: id1 has a blank space in f.41270.3 ID2 has a blank space in f.41270.2 because 184.11 is actually in the f.41270.3 column the tabs just didn't work in this post and id3 is all blanks)
ID f.41270.1 f.41270.2 f.41270.3
id1 184.11 151.11
id2 987 184.11
id3

CONVERTED DATA TO CREATE
Either add X_ or force columns to be col_character() because ICD9 has no char it is double but ICD10 is char
ID X_184.11 X_151.11 X_987
id1 1 1 0
id2 1 0 1
id3 0 0 0

EDIT:
library(data.table)

patients = data.table(
ID = c("id1", "id2", "id3"),
f.41270.1 = c("184.11", "987", ""),
f.41270.2 = c("151.11", "", ""),
f.41270.3 = c("", "184.11", "")
)

ICDl = data.table(c("184.11", "151.11", "987", "184.11"))
END EDIT

library(tidyverse)
library(data.table)

ICDl <- read_tsv("ICD_long.txt") # row of all ICD disease codes NOT the general codes f.41270.1 f.41270.2 f.41270.3 but rather the specific codes 184.11 151.11 987 184.11
patients <- read_tsv("patients.txt") # column of patient IDs

patientsRows = count(patients, vars = "ID")
patientsRowsToAdd = as.integer(patientsRows[2])-1

# https://www.rdocumentation.org/packages/berryFunctions/versions/1.20.1/topics/addRows
ICDl = addRows(ICDl, patientsRowsToAdd)
# This fills the blank spaces in the table with NA and that makes the full table too big (~130 GB)

# https://www.rdocumentation.org/packages/berryFunctions/versions/1.20.1/topics/addRows
ICDl = addRows(ICDl, patientsRowsToAdd, values = "0")
# "Killed" I guess I cannot do values = "0" to try to make the table smaller

# https://readr.tidyverse.org/reference/cols.html
# https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/type.convert
ICDl = type.convert(ICDl)
# This could make the table smaller by changing the col_character() columns to col_integer() or col_logical() but I cannot test it yet because values = "0" doesn't work

# https://stackoverflow.com/questions/19508256/how-to-add-new-column-to-an-dataframe-to-the-front-not-end
ICDlPD = cbind(patients, ICDl)
# addRows was required to get cbind to work because cbind only adds a column if the number of rows is equal
# this should add the patient IDs to the first column but leaves all of the ICD disease codes in the first row

# Then I can figure out how to add 1s everywhere the patient has the ICD disease code.  Or maybe I could skip the above if you tell me how to add the patients one row at a time.

williaml · July 5, 2021, 10:19pm

Can you provide a reproducible example of your dataset?

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

HariSeldon · July 5, 2021, 10:31pm

library(data.table)

patients = data.table(
ID = c("id1", "id2", "id3"),
f.41270.1 = c("184.11", "987", ""),
f.41270.2 = c("151.11", "", ""),
f.41270.3 = c("", "184.11", "")
)

ICDl = data.table(c("184.11", "151.11", "987", "184.11"))

CONVERTED DATA TO CREATE
Either add X_ or force columns to be col_character() because ICD9 has no char it is double but ICD10 is char
ID X_184.11 X_151.11 X_987
id1 1 1 0
id2 1 0 1
id3 0 0 0

williaml · July 6, 2021, 7:54am

For this question, it looks like there is no difference.

x <- c(TRUE, TRUE, FALSE)
y <- as.integer(c(1, 2, 3))

> object.size(x)
64 bytes
> object.size(y)
64 bytes

You might need to provide a little bit more data in your example. What happens to the ID column in patients?

nirgrahamuk · July 6, 2021, 10:09am

I like to use dtplyr to leverage convenience of tidyverse verbiage with performance of DT backend.
I tried Matrix library as it supports a sparse representation that would seem suitable to this task, though you may be limited in terms of ease of operations upon it...

library(dtplyr) #renv::install("tidyverse/dtplyr")
library(dplyr)
library(tidyr)
library(forcats)
library(Matrix)
library(fst)
library(pryr)
patients = data.frame(
  ID = c("id1", "id2", "id3"),
  f.41270.1 = c("184.11", "987", ""),
  f.41270.2 = c("151.11", "", ""),
  f.41270.3 = c("", "184.11", "")
) 
object_size(patients)
#2.64Kb 

plong_0 <- patients %>% lazy_dt %>% pivot_longer(-ID) %>%
  filter(value!="") %>% 
  mutate(row=as.numeric(ID),
         colid=value %>% 
           forcats::as_factor() %>% 
           as.numeric) 

plong_1 <- plong_0 %>% select(ID,row,colid) %>% collect()
object_size(plong_1)
#1.73Kb
# get column names will be using
diseases  <- distinct(plong_0,value) %>%
    mutate(colid=value %>% 
             forcats::as_factor() %>% 
             as.numeric ) %>%
    collect()


result <- sparseMatrix(i=plong_1$row,
                       j=plong_1$colid,
                       x = rep(1,nrow(plong_1)))
colnames(result) <- diseases$value
object_size(result)
#1.81Kb

result
saveRDS(result,file="testres.RDS",compress = TRUE)
#200B

HariSeldon · July 6, 2021, 2:46pm

The ID column in patients becomes the ID column in CONVERTED DATA TO CREATE, the first column of both is:
ID
id1
id2
id3

HariSeldon · July 8, 2021, 5:34pm

I got an error despite doing library(tidyverse):

Error in UseMethod("pivot_longer") :
  no applicable method for 'pivot_longer' applied to an object of class "c('dtplyr_step_first', 'dtplyr_step')"
Calls: %>% -> mutate -> filter -> pivot_longer
Execution halted

nirgrahamuk · July 8, 2021, 6:05pm

And the other libraries?
I.e. dtplyr

HariSeldon · July 8, 2021, 6:14pm

library(data.table)
library(parallel)
library(tidyverse)
library(dtplyr)
library(dplyr)
library(tidyr)
library(forcats)
library(Matrix)
library(fst)
library(pryr)

nirgrahamuk · July 8, 2021, 6:15pm

Is the error when you run my code or your code or both ?

HariSeldon · July 8, 2021, 6:16pm

Your code, I assume it is having problems with:

plong_0 <- patients %>% lazy_dt %>% pivot_longer(-"ID") %>%
  filter(value!="") %>% 
  mutate(row=as.numeric("ID"),
         colid=value %>% 
           forcats::as_factor() %>% 
           as.numeric)

nirgrahamuk · July 11, 2021, 9:12am

What process did you use to install dtplyr?. The CRAN version doesn't yet support pivot , but the dev version that I gave the code for installing through renv does.

HariSeldon · July 14, 2021, 2:57pm

Oh I used CRAN I didn't know that was the wrong thing to do.

HariSeldon · July 20, 2021, 8:00pm

That is not working it turns into all NA. plog_0's "row" column is all NA. I need a way to get the row number of the patient ID, but the patient ID itself does not contain the row number so as.numeric() would not give the row number.

> result <- sparseMatrix(i=plong_1$row,
+                        j=plong_1$colid,
+                        x = rep(1,nrow(plong_1)))
> Error in sparseMatrix(i = plong_1$row, j = plong_1$colid, x = rep(1, nrow(plong_1))) :
>   NA's in (i,j) are not allowed

NAs came from:

> plong_0 <- patients %>% lazy_dt %>% pivot_longer(-ID) %>%
+   filter(value!="") %>%
+   mutate(row=as.numeric(ID),
+          colid=value %>%
+            forcats::as_factor() %>%
+            as.numeric)
Error in UseMethod("pivot_longer") :
  no applicable method for 'pivot_longer' applied to an object of class "c('dtplyr_step_first', 'dtplyr_step')"
> plong_0 <- patients %>% pivot_longer(-ID) %>%
+   filter(value!="") %>%
 +   mutate(row=as.numeric(ID),
+          colid=value %>%
+            forcats::as_factor() %>%
+            as.numeric)
Warning message:
Problem with `mutate()` column `row`.
ℹ `row = as.numeric(ID)`.
ℹ NAs introduced by coercion

nirgrahamuk · July 21, 2021, 10:47am

I understand.
You are likely using R.4, at the time I answered my session was running R3.6 so the default for data.frame stringAsFactors was TRUE for me, and FALSE for you.

We can align by adding

  %>% mutate_all(as.factor) %>%

to the chain

plong_0 <- patients %>% lazy_dt %>%
  mutate_all(as.factor) %>%
  pivot_longer(-ID) %>%
  filter(value!="") %>% 
  mutate(row=as.numeric(ID),
         colid=value %>% 
           forcats::as_factor() %>% 
           as.numeric)

gitdemont · July 21, 2021, 3:36pm

Hi,

Not sure I fully understood your data structure and what you wnat to keep or not
But maybe this could help:

library(data.table)
patients <- data.table(
ID = c("id1", "id2", "id3"),
f.41270.1 = c("184.11", "987", ""),
f.41270.2 = c("151.11", "", ""),
f.41270.3 = c("", "184.11", "")
)
object.size(patients)

2440 bytes

patients2 <- apply(patients[, -1], MARGIN = 2, FUN = function(x) as.raw(x != ""))
rownames(patients2) <- patients$ID
object.size(patients2)

928 bytes

HariSeldon · July 21, 2021, 4:48pm

gitdemont, Let me explain how that is not quite right. You have the output column names be the disease group codes f.41270.1 f.41270.2 and f.41270.3 but I want the output column names to be the NOT group disease codes rather be the specific diseases 184.11 151.11 and 987

The point is to make a table to answer the question whether a certain patient had a specific disease, which is why I want the specific diseases to be the column names not the disease group codes.

Here is the output of your code:

head(patients2)
f.41270.1 f.41270.2 f.41270.3
id1 01 01 00
id2 01 00 01
id3 00 00 00

What I actually want is:

ID 184.11 151.11 987
id1 01 01 00
id2 01 00 01
id3 00 00 00

Maybe my example table isn't the best because your disease group code table above (patients2) and the specific disease code table right after that in this post look the same.

HariSeldon · July 21, 2021, 5:24pm

nirgrahamuk, I am using R402 so thanks for mutate_all but that doesn't fix the error that lazy_dt is not working for me and row is all NA.

plong_0 <- patients %>% lazy_dt %>%
mutate_all(as.factor) %>%
pivot_longer(-"ID") %>%
filter(value!="") %>%
mutate(row=as.numeric("ID"),
colid=value %>%
forcats::as_factor() %>%
as.numeric)

Error in UseMethod("pivot_longer") :
no applicable method for 'pivot_longer' applied to an object of class "c('dtplyr_step_mutate', 'dtplyr_step')"

So I continue without it but get an error that row is all NA. I think row is NOT supposed to be all NA what is going on?

plong_0 <- patients %>%
mutate_all(as.factor) %>%
pivot_longer(-"ID") %>%
filter(value!="") %>%
mutate(row=as.numeric("ID"),
colid=value %>%
forcats::as_factor() %>%
as.numeric)

Warning message:
Problem with mutate() column row.
row = as.numeric("ID").
NAs introduced by coercion

plong_1 <- plong_0 %>% dplyr::select("ID",row,colid) %>% collect()

diseases <- distinct(plong_0,value) %>%
mutate(colid=value %>%
forcats::as_factor() %>%
as.numeric ) %>%
collect()

result <- sparseMatrix(i=plong_1$row,
j=plong_1$colid,
x = rep(1,nrow(plong_1)))

Error in sparseMatrix(i = plong_1$row, j = plong_1$colid, x = rep(1, nrow(plong_1))) :
NA's in (i,j) are not allowed

nirgrahamuk · July 21, 2021, 5:53pm

your mutate quotes id , you have "ID", i dont have quotation around id, when i mutate it.
I.e. by quoting it you are litteraly asking for the numeric value of the string 'ID' rather than the numeric values of the factor integer values from the id column

HariSeldon · July 21, 2021, 7:47pm

I'm pretty sure mutate(row=as.numeric(ID) is not working it seems to not change id1 id2 and id3 to 1 2 and 3 it seems to turn them all to NA which is NOT what we want. Warning, I cannot expect my real patient data to have patient IDs that contain the row numbers like that.

Warning message:
Problem with mutate() column row.
row = as.numeric(ID).
NAs introduced by coercion

Not even your original code works I am suspicious about all the row being NA

plong_0 <- patients %>% pivot_longer(-ID) %>%
filter(value!="") %>%
mutate(row=as.numeric(ID),
colid=value %>%
forcats::as_factor() %>%
as.numeric)

plong_1 <- plong_0 %>% select(ID,row,colid) %>% collect()

diseases <- distinct(plong_0,value) %>%
mutate(colid=value %>%
forcats::as_factor() %>%
as.numeric ) %>%
collect()

result <- sparseMatrix(i=plong_1$row,
j=plong_1$colid,
x = rep(1,nrow(plong_1)))

Error in sparseMatrix(i = plong_1$row, j = plong_1$colid, x = rep(1, nrow(plong_1))) :
NA's in (i,j) are not allowed

Your new code still doesn't fix lazy_dt I a, using R402 and I still get errors about mutate do we really have to mutate what if we don't?:

librarian::shelf(tidyverse / dtplyr)
library(dplyr)
library(tidyr)
library(forcats)
library(Matrix)
library(fst)
library(pryr)

library(data.table)

patients = data.table(
ID = c("id1", "id2", "id3"),
f.41270.1 = c("184.11", "987", ""),
f.41270.2 = c("151.11", "", ""),
f.41270.3 = c("", "184.11", "")
)

plong_0 <- patients %>% lazy_dt %>%
mutate_all(as.factor) %>%
pivot_longer(-ID) %>%
filter(value!="") %>%
mutate(row=as.numeric(ID),
colid=value %>%
forcats::as_factor() %>%
as.numeric)

Error in UseMethod("pivot_longer") :
no applicable method for 'pivot_longer' applied to an object of class "c('dtplyr_step_mutate', 'dtplyr_step')"

plong_0 <- patients %>%
mutate_all(as.factor) %>%
pivot_longer(-ID) %>%
filter(value!="") %>%
mutate(row=as.numeric(ID),
colid=value %>%
forcats::as_factor() %>%
as.numeric)

Error in UseMethod("pivot_longer") :
no applicable method for 'pivot_longer' applied to an object of class "c('dtplyr_step_mutate', 'dtplyr_step')"

My current issue is

Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 102
Calls: write_tsv ... as.data.frame -> as.matrix -> as.matrix -> as -> asMethod

Line 102 is:

99 plong_0 <- patients %>% pivot_longer(-ID) %>%
100 filter(value!="") %>%
101 mutate(row=as.numeric(ID),
102 colid=value %>%
103 forcats::as_factor() %>%
104 as.numeric)