Replace most non-header fields in a TSV file based on a TSV conversion table

HariSeldon · June 11, 2021, 11:47am

I am trying to convert all ICD codes in a tab separated file 002.txt to Phecodes (based on ICD-Phecode conversion tables tab separated files phecode_icd9_rolled.csv and Phecode_map_v1_2_icd10cm_beta.csv) for a biology bioinformatics project.

002.txt named patientdata anonymized table I need to replace all ICD codes with Phecodes but skip column names and first column row names

1   2   3   4   5   6   7   8   9
patient1   K40.9   K43.9   N20.0   N20.1   N23 N39.0   R69 Z88.1
patient2   B96.8   D12.6   E11.6   E87.6   I44.7   K40.9   K43.9   K52.9
patient3   NOT

Phecode_map_v1_2_icd10cm_beta.csv named ICCD10csv conversion table ICD10 to phecode

icd10cm phecode
K40.9   550.1
K43.9   550.5
N20.0   594.1
N20.1   594.3
N23 594.8
N39.0   591
R69 1019
Z88.1   960.1
B96.8   041
D12.6   208
E11.6   250.2
E87.6   276.14
I44.7   426.32
K40.9   550.1
K43.9   550.5
K52.9   558
XNO    17

library(tidyverse)

patientdata <- read_tsv("002.txt")
ICD9csv <- read_csv("phecode_icd9_rolled.csv")
ICCD10csv <- read_csv("Phecode_map_v1_2_icd10cm_beta.csv")

This link explains the various joins that tidyverse can do, and the closest to what I want is I think semi_join()
https://r4ds.had.co.nz/relational-data.html
Not sure if this is right

patientdata %>% 
  semi_join(ICCD10csv, by = "icd10cm")

I am also trying out merge() but am getting error messages
https://www.rdocumentation.org/packages/data.table/versions/1.14.0/topics/merge

patientdata10 <- merge(patientdata9, ICCD10csv, by.x = all, by.y = "icd10cm")
Error in as.vector(x, mode) :
cannot coerce type 'builtin' to vector of type 'any'

# This is trying to specify by.x as all elements in the first row of patientdata, meaning by.x will merge on all columns of patientdata
patientdata10 <- merge(patientdata9, ICCD10csv, by.x = patientdata9[,1], by.y = "icd10cm")
Error in fix.by(by.x, x) :
'by' must specify one or more columns as numbers, names or logical

patientdata10 <- merge(patientdata9, ICCD10csv, by.x = patientdata9[1,], by.y = "icd10cm")
Error in fix.by(by.x, x) :
'by' must specify one or more columns as numbers, names or logical

REPREX:

library(tidyverse)
library(data.table)

patientdata <- data.table(
                 1 = c(patient1, patient2, patient3),
                 2 = c(K40.9, B96.8, NOT),
                 3 = c(K43.9, D12.6),
                 4 = c(N20.0, E11.6),
                 5 = c(N20.1, E87.6),
                 6 = c(N23, I44.7),
                 7 = c(N39.0, K40.9),
                 8 = c(R69, K43.9),
                 9 = c(Z88.1, K52.9)
)

ICCD10csv <- data.table(
  icd10cm = c("K40.9", "K43.9", "N20.0", "N20.1", "N23", "N39.0", "R69", "Z88.1", "B96.8", "D12.6", "E11.6", "E87.6", "I44.7", "K40.9", "K43.9", "K52.9", "XNO"),
  phecode = c("550.1", "550.5", "594.1", "594.3", "594.8", "591", "1019", "960.1", "041", "208", "250.2", "276.14", "426.32", "550.1", "550.5", "558", "17")
)

# END OF SUCCESSFUL CODE START OF GUESSES

patientdata %>% 
  semi_join(ICCD10csv, by = "icd10cm")

patientdata10 <- merge(patientdata, ICCD10csv, by.x = all, by.y = "icd10cm")

# This is trying to specify by.x as all elements in the first row of patientdata, meaning by.x will merge on all columns of patientdata
patientdata10 <- merge(patientdata, ICCD10csv, by.x = patientdata[,1], by.y = "icd10cm")

patientdata10 <- merge(patientdata, ICCD10csv, by.x = patientdata[1,], by.y = "icd10cm")

nirgrahamuk · June 11, 2021, 12:12pm

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

HariSeldon · June 11, 2021, 12:40pm

OK I added a REPREX at the bottom I loaded tidyverse, patientdata, and ICD10csv and the rest of the code after that is failed attempts to replace the patientdata ICD10 codes with phecodes from the ICD10csv

nirgrahamuk · June 11, 2021, 1:00pm

HariSeldon:

library(tidyverse)

patientdata <- data.table(
                 1 = c(patient1, patient2, patient3),
                 2 = c(K40.9, B96.8, NOT),
                 3 = c(K43.9, D12.6),
                 4 = c(N20.0, E11.6),
                 5 = c(N20.1, E87.6),
                 6 = c(N23, I44.7),
                 7 = c(N39.0, K40.9),
                 8 = c(R69, K43.9),
                 9 = c(Z88.1, K52.9)
)

I don't find that your patientdata or ICCD10csv examples are reproducible. I.e. in a fresh R session, I only get errors when I try to have them defined.
I think you perhaps wrote them by hand , rather than using one of the automated methods suggested ? and perhaps didnt test the result ?

Can you please try again ?

HariSeldon · June 11, 2021, 1:18pm

Sorry I forgot the quotes around the column elements in data.table and forgot to import data.table.

I got the code working on my R Studio now see the output below

> library(tidyverse)
> library(data.table)
> patientdata = data.table(
+   a = c("patient1", "patient2", "patient3"),
+   b = c("K40.9", "B96.8", "NOT"),
+   c = c("K43.9", "D12.6", ""),
+   d = c("N20.0", "E11.6", ""),
+   e = c("N20.1", "E87.6", ""),
+   f = c("N23", "I44.7", ""),
+   g = c("N39.0", "K40.9", ""),
+   h = c("R69", "K43.9", ""),
+   i = c("Z88.1", "K52.9", "")
+ )
> ICCD10csv <- data.table(
+   icd10cm = c("K40.9", "K43.9", "N20.0", "N20.1", "N23", "N39.0", "R69", "Z88.1", "B96.8", "D12.6", "E11.6", "E87.6", "I44.7", "K40.9", "K43.9", "K52.9", "XNO"),
+   phecode = c("550.1", "550.5", "594.1", "594.3", "594.8", "591", "1019", "960.1", "041", "208", "250.2", "276.14", "426.32", "550.1", "550.5", "558", "17")
+ )
>

nirgrahamuk · June 11, 2021, 1:27pm


 (piv_pat <- pivot_longer(
   patientdata,
   cols=-a
 ))
 
 (piv_pat_jn <- left_join(piv_pat,
                         distinct(ICCD10csv),
                         by=c("value"="icd10cm")))
 
 (piv_pat_rewide <- pivot_wider(piv_pat_jn,
                                id_cols = "a",
                                names_from = "name",
                                values_from = "phecode"
                                ))

HariSeldon · June 11, 2021, 1:46pm

HariSeldon:

library(tidyverse)
library(data.table)

patientdata = data.table(
  a = c("patient1", "patient2", "patient3"),
  b = c("K40.9", "B96.8", "NOT"),
  c = c("K43.9", "D12.6", ""),
  d = c("N20.0", "E11.6", ""),
  e = c("N20.1", "E87.6", ""),
  f = c("N23", "I44.7", ""),
  g = c("N39.0", "K40.9", ""),
  h = c("R69", "K43.9", ""),
  i = c("Z88.1", "K52.9", "")
)

ICCD10csv <- data.table(
  icd10cm = c("K40.9", "K43.9", "N20.0", "N20.1", "N23", "N39.0", "R69", "Z88.1", "B96.8", "D12.6", "E11.6", "E87.6", "I44.7", "K40.9", "K43.9", "K52.9", "XNO"),
  phecode = c("550.1", "550.5", "594.1", "594.3", "594.8", "591", "1019", "960.1", "041", "208", "250.2", "276.14", "426.32", "550.1", "550.5", "558", "17")
)

Sorry I chose not to show you the real column names for privacy reasons, despite them being public information (e.g. f.eid (the only one I show here) showing up in others' public GitHub code createUKBphenome/scripts/function.createUKBphenome.r at master · umich-cphds/createUKBphenome · GitHub).

nirgrahamuk · June 11, 2021, 1:57pm

Hello,
I appreciate you may be frustrated but you just posted a very large post, with very little info of value contained.
So your example data doesnt match your real data by column name 'a' != 'f.eid'
Thats trivial to change, and I assume you have done so ,since you share an error related to a subsequent part of the code . But you havent made it easy to identify what line of code triggered the error you shared.
the pivot_long ? the join ? the pivot wide?

put simply if there are errors caused by running your code against your true data, but not your example data, then this demonstrates that your example_data doesnt capture sufficiently the finer details of your task.

I expect there are type differences in certain columns.
I would use glimpse() or skimr::skim() to understand that.

HariSeldon · June 11, 2021, 2:05pm

changed cols=-a to cols=-f.eid

> (piv_pat <- pivot_longer(
+   patientdata,
+   cols="f.eid"
+   # cols = starts_with("f.4")
+ ))

OK I got cols="f.eid" working but there are type differences in the columns that I need to fix as you say

I tried this:

patientdata  = mutate_all(patientdata, as.character)
ICD9csv = mutate_all(ICD9csv, as.character)
ICCD10csv  = mutate_all(ICCD10csv, as.character)

To try to fix this:

Error: Can't join on `x$value` x `y$value` because of incompatible types.
ℹ `x$value` is of type <double>>.
ℹ `y$value` is of type <character>>.
Run `rlang::last_error()` to see where the error occurred.
> (piv_pat_rewide <- pivot_wider(piv_pat_jn,
+                                id_cols = "f.eid",
+                                names_from = "name",
+                                values_from = "phecode"
+ ))
Error in pivot_wider(piv_pat_jn, id_cols = "f.eid", names_from = "name",  : 
  object 'piv_pat_jn' not found
>

And it is currently running this might work

nirgrahamuk · June 11, 2021, 2:30pm

If the columns to pivot are all the columns except f.eid so f.eid should be preceeded by a minus symbol, as in my example where I wanted to pivot all columns except a so I used cols = -a
otherwise replace cols -whatever with a positive choice of wide columns to make longer, but I'd prefer to preceed by selecting them so extraneously wide columns are not retained.

HariSeldon · June 11, 2021, 2:44pm

Thanks! Is there a way to avoid having to do mutate_all to avoid the error below?

patientdata = mutate_all(patientdata, as.character)
ICD9csv = mutate_all(ICD9csv, as.character)
ICCD10csv = mutate_all(ICCD10csv, as.character)

Without doing the above and turning everything to char this is the error I get:

Error: Can't combine column <character> and column <double>.

nirgrahamuk · June 11, 2021, 2:53pm

If mutate_all solves it then I wouldn't look for alternative solutions.

HariSeldon · June 11, 2021, 2:58pm

I run out of memory on my laptop before finishing (no errors but does not finish) and the larger server is down so I cannot further test I will do another reply here if I have another issue, thanks so much!

HariSeldon · June 11, 2021, 5:03pm

I got the server working and got this error:

Error: Internal error in `compact_rep()`: Negative `n` in `compact_rep()`.
Backtrace:
    █
 1. ├─tidyr::pivot_wider(piv_pat_jn, names_from = "name", values_from = "phecode")
 2. └─tidyr:::pivot_wider.data.frame(...)
 3.   └─tidyr::pivot_wider_spec(...)
 4.     └─vctrs::vec_init(val, nrow * ncol)
 5.       └─(function () ...
Warning message:
In nrow * ncol : NAs produced by integer overflow
Execution halted

There are a lot of NA in patientdata because some patients don't even have any ICD codes so they are all NA and some patients have some ICD codes but still have leftover NAs in that row.

HariSeldon · June 11, 2021, 9:59pm

Ok I got the code to run without errors here is the code it appears that the integer overflow is because I need more than the standard 1GB of RAM.

system · June 21, 2021, 3:09pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.