Replace most non-header fields in a TSV file based on a TSV conversion table

I am trying to convert all ICD codes in a tab separated file 002.txt to Phecodes (based on ICD-Phecode conversion tables tab separated files phecode_icd9_rolled.csv and Phecode_map_v1_2_icd10cm_beta.csv) for a biology bioinformatics project.

002.txt named patientdata anonymized table I need to replace all ICD codes with Phecodes but skip column names and first column row names

1   2   3   4   5   6   7   8   9
patient1   K40.9   K43.9   N20.0   N20.1   N23 N39.0   R69 Z88.1
patient2   B96.8   D12.6   E11.6   E87.6   I44.7   K40.9   K43.9   K52.9
patient3   NOT

Phecode_map_v1_2_icd10cm_beta.csv named ICCD10csv conversion table ICD10 to phecode

icd10cm phecode
K40.9   550.1
K43.9   550.5
N20.0   594.1
N20.1   594.3
N23 594.8
N39.0   591
R69 1019
Z88.1   960.1
B96.8   041
D12.6   208
E11.6   250.2
E87.6   276.14
I44.7   426.32
K40.9   550.1
K43.9   550.5
K52.9   558
XNO    17

patientdata <- read_tsv("002.txt")
ICD9csv <- read_csv("phecode_icd9_rolled.csv")
ICCD10csv <- read_csv("Phecode_map_v1_2_icd10cm_beta.csv")

This link explains the various joins that tidyverse can do, and the closest to what I want is I think semi_join()
Not sure if this is right

patientdata %>% 
  semi_join(ICCD10csv, by = "icd10cm")

I am also trying out merge() but am getting error messages

patientdata10 <- merge(patientdata9, ICCD10csv, by.x = all, by.y = "icd10cm")
Error in as.vector(x, mode) :
cannot coerce type 'builtin' to vector of type 'any'

# This is trying to specify by.x as all elements in the first row of patientdata, meaning by.x will merge on all columns of patientdata
patientdata10 <- merge(patientdata9, ICCD10csv, by.x = patientdata9[,1], by.y = "icd10cm")
Error in, x) :
'by' must specify one or more columns as numbers, names or logical

patientdata10 <- merge(patientdata9, ICCD10csv, by.x = patientdata9[1,], by.y = "icd10cm")
Error in, x) :
'by' must specify one or more columns as numbers, names or logical



patientdata <- data.table(
                 1 = c(patient1, patient2, patient3),
                 2 = c(K40.9, B96.8, NOT),
                 3 = c(K43.9, D12.6),
                 4 = c(N20.0, E11.6),
                 5 = c(N20.1, E87.6),
                 6 = c(N23, I44.7),
                 7 = c(N39.0, K40.9),
                 8 = c(R69, K43.9),
                 9 = c(Z88.1, K52.9)

ICCD10csv <- data.table(
  icd10cm = c("K40.9", "K43.9", "N20.0", "N20.1", "N23", "N39.0", "R69", "Z88.1", "B96.8", "D12.6", "E11.6", "E87.6", "I44.7", "K40.9", "K43.9", "K52.9", "XNO"),
  phecode = c("550.1", "550.5", "594.1", "594.3", "594.8", "591", "1019", "960.1", "041", "208", "250.2", "276.14", "426.32", "550.1", "550.5", "558", "17")


patientdata %>% 
  semi_join(ICCD10csv, by = "icd10cm")

patientdata10 <- merge(patientdata, ICCD10csv, by.x = all, by.y = "icd10cm")

# This is trying to specify by.x as all elements in the first row of patientdata, meaning by.x will merge on all columns of patientdata
patientdata10 <- merge(patientdata, ICCD10csv, by.x = patientdata[,1], by.y = "icd10cm")

patientdata10 <- merge(patientdata, ICCD10csv, by.x = patientdata[1,], by.y = "icd10cm")


To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

OK I added a REPREX at the bottom I loaded tidyverse, patientdata, and ICD10csv and the rest of the code after that is failed attempts to replace the patientdata ICD10 codes with phecodes from the ICD10csv

I don't find that your patientdata or ICCD10csv examples are reproducible. I.e. in a fresh R session, I only get errors when I try to have them defined.
I think you perhaps wrote them by hand , rather than using one of the automated methods suggested ? and perhaps didnt test the result ?

Can you please try again ?

Sorry I forgot the quotes around the column elements in data.table and forgot to import data.table.

I got the code working on my R Studio now see the output below

> library(tidyverse)
> library(data.table)
> patientdata = data.table(
+   a = c("patient1", "patient2", "patient3"),
+   b = c("K40.9", "B96.8", "NOT"),
+   c = c("K43.9", "D12.6", ""),
+   d = c("N20.0", "E11.6", ""),
+   e = c("N20.1", "E87.6", ""),
+   f = c("N23", "I44.7", ""),
+   g = c("N39.0", "K40.9", ""),
+   h = c("R69", "K43.9", ""),
+   i = c("Z88.1", "K52.9", "")
+ )
> ICCD10csv <- data.table(
+   icd10cm = c("K40.9", "K43.9", "N20.0", "N20.1", "N23", "N39.0", "R69", "Z88.1", "B96.8", "D12.6", "E11.6", "E87.6", "I44.7", "K40.9", "K43.9", "K52.9", "XNO"),
+   phecode = c("550.1", "550.5", "594.1", "594.3", "594.8", "591", "1019", "960.1", "041", "208", "250.2", "276.14", "426.32", "550.1", "550.5", "558", "17")
+ )

 (piv_pat <- pivot_longer(
 (piv_pat_jn <- left_join(piv_pat,
 (piv_pat_rewide <- pivot_wider(piv_pat_jn,
                                id_cols = "a",
                                names_from = "name",
                                values_from = "phecode"

Sorry I chose not to show you the real column names for privacy reasons, despite them being public information (e.g. f.eid (the only one I show here) showing up in others' public GitHub code createUKBphenome/function.createUKBphenome.r at master · umich-cphds/createUKBphenome · GitHub).

I appreciate you may be frustrated but you just posted a very large post, with very little info of value contained.
So your example data doesnt match your real data by column name 'a' != 'f.eid'
Thats trivial to change, and I assume you have done so ,since you share an error related to a subsequent part of the code . But you havent made it easy to identify what line of code triggered the error you shared.
the pivot_long ? the join ? the pivot wide?

put simply if there are errors caused by running your code against your true data, but not your example data, then this demonstrates that your example_data doesnt capture sufficiently the finer details of your task.

I expect there are type differences in certain columns.
I would use glimpse() or skimr::skim() to understand that.

changed cols=-a to cols=-f.eid

> (piv_pat <- pivot_longer(
+   patientdata,
+   cols="f.eid"
+   # cols = starts_with("f.4")
+ ))

OK I got cols="f.eid" working but there are type differences in the columns that I need to fix as you say

I tried this:

patientdata  = mutate_all(patientdata, as.character)
ICD9csv = mutate_all(ICD9csv, as.character)
ICCD10csv  = mutate_all(ICCD10csv, as.character)

To try to fix this:

Error: Can't join on `x$value` x `y$value` because of incompatible types.
ℹ `x$value` is of type <double>>.
ℹ `y$value` is of type <character>>.
Run `rlang::last_error()` to see where the error occurred.
> (piv_pat_rewide <- pivot_wider(piv_pat_jn,
+                                id_cols = "f.eid",
+                                names_from = "name",
+                                values_from = "phecode"
+ ))
Error in pivot_wider(piv_pat_jn, id_cols = "f.eid", names_from = "name",  : 
  object 'piv_pat_jn' not found

And it is currently running this might work

If the columns to pivot are all the columns except f.eid so f.eid should be preceeded by a minus symbol, as in my example where I wanted to pivot all columns except a so I used cols = -a
otherwise replace cols -whatever with a positive choice of wide columns to make longer, but I'd prefer to preceed by selecting them so extraneously wide columns are not retained.

Thanks! Is there a way to avoid having to do mutate_all to avoid the error below?

patientdata = mutate_all(patientdata, as.character)
ICD9csv = mutate_all(ICD9csv, as.character)
ICCD10csv = mutate_all(ICCD10csv, as.character)

Without doing the above and turning everything to char this is the error I get:

Error: Can't combine column <character> and column <double>.

If mutate_all solves it then I wouldn't look for alternative solutions.

I run out of memory on my laptop before finishing (no errors but does not finish) and the larger server is down so I cannot further test I will do another reply here if I have another issue, thanks so much!

I got the server working and got this error:

Error: Internal error in `compact_rep()`: Negative `n` in `compact_rep()`.
 1. ├─tidyr::pivot_wider(piv_pat_jn, names_from = "name", values_from = "phecode")
 2. └─
 3.   └─tidyr::pivot_wider_spec(...)
 4.     └─vctrs::vec_init(val, nrow * ncol)
 5.       └─(function () ...
Warning message:
In nrow * ncol : NAs produced by integer overflow
Execution halted

There are a lot of NA in patientdata because some patients don't even have any ICD codes so they are all NA and some patients have some ICD codes but still have leftover NAs in that row.

Ok I got the code to run without errors here is the code it appears that the integer overflow is because I need more than the standard 1GB of RAM.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.