concatenating multiple lines based on regex group - help please

hello, I have a unique issue.
What i am having issues with is concatenating different lines based on a regex group pattern. Medical text is extremely messy compared to other text analyses I have done.

For context: I have created a fake data set shown below that mimics the formatting of the text analysis for medical records that i am conducting, i cannot share the actual data as per privacy reasons. I have identified the regex for the note id, patient id, note title, date. I have also managed to code a grouping solution that identifies which text belongs to what group.

At this point ive run out of ideas to be able to concatenate different lines of code based on an identified regex group, so if anyone that can help i would sincerely appreciate it, I'm up to my limits in R coding knowledge.

text <- c("1\t8\t2\t5012312343\tphysician note\tdude is not doing so well, having issues smelling. will recommmend ct scan.",
          "plan will be to call back patient 1 week after scan.",
          "\t2019-09-25",
          "2\t8\t2\t5053245235\temergency visit\tkid in his 20s came in having a high fever.  prescribed antibiotics.",
          "recommended to come see after 3 days.",
          "\t2019-04-10")

enc_id <- "([0-9]{1}\\t)"
wholetext_id <- cumsum(str_detect(text, enc_id))

I appreciate any help in advance. Take care and stay safe.

  • Corgi coder

This is more of a vector problem than a repex one.

Every R problem can be thought of with advantage as the interaction of three objects: an existing object, x , a desired object, y, and a function, f, that will return a value of y given x as an argument. Any or all of these may be composed from other objects.

What is at hand, text is x. It is of class character arranged in a vector of

> length(text)
[1] 6

elements.

What is desired, y, is a tibble or data frame that contains one row for each record embedded in text and one column for each variable embedded in each row.

By inspection of text and relying on the assumption that the actual data follows the same pattern, the first three elements

text[1:3,]

comprise a record, as do

text[4:6]

So, the first task is to isolate each group of three for further processing.

For each group (called a chunk in the sample below), we need to detangle the variables. Again, relying on inspection, each chunk is comprised of nine, separated by tab characters. Splitting the chunk on \t (discarding \t in the process) provides a single row.

It only remains to get that row into a receiver data frame, which is done here with a for loop.

# LIBRARIES
		suppressPackageStartupMessages({
		  library(dplyr)
		  library(stringr)
		})

		# CONSTANTS

		header <- c("row_id","patient_id","unknown1","note_id",
		            "note_type","note","plan","unknown2","dated")

		record_lines <- 3

		# FUNCTIONS

		# https://tomaztsql.wordpress.com/2020/10/20/little-useless-useful-r-function-dataframe-maker/
		DataFrameMaker  <- function(row,col){
		  dd <- matrix(nrow = row, ncol = col)
		  for (i in 1:row) {
		    for (j in 1:col) {
		      dd[i, j] = (j*i)
		    }
		  }
		  return(as.data.frame(dd))
		}

		dfm <- DataFrameMaker
		dfm(1,9) -> stub
		colnames(stub) <- header
		stub[0,] -> stub

		chunk <- function(x, y) split(seq_along(x), ceiling(seq_along(x)/y))

		detangle <- function(x) {
		  str_split(x,"\\t") %>% unlist()
		}

		mk_record <- function(x) {
		  c(row_id = x[1], patient_id = x[2], unknown1 = x[3],
		    note_id = x[4], note_type = x[5], note = x[6], 
		    plan = x[7], unknown2 = x[8], date = x[9])
		}

		# DATA

		text <- c(
		  "1\t8\t2\t5012312343\tphysician note\tdude is not doing so well, having issues smelling. will recommmend ct scan.",
		  "plan will be to call back patient 1 week after scan.",
		  "\t2019-09-25",
		  "2\t8\t2\t5053245235\temergency visit\tkid in his 20s came in having a high fever.  prescribed antibiotics.",
		  "recommended to come see after 3 days.",
		  "\t2019-04-10"
		)

		# MAIN

		# split the character vector into chunks of three
		# (because there are six elements in text, representing
		# two records)

		chunk(text,record_lines) -> chunks

		# show one
		(chunk1 <- text[chunks[[1]]])

		# split on \t and reassemble

		pander::pander(detangle(chunk1 %>% mk_record())

		# create receiver object

		dfm <- DataFrameMaker
		dfm(1,9) -> stub
		stub[0,] -> stub

		# loop through the chunks, process and
		# row bind to reeiver object

		for (i in chunk(text,record_lines)){
		  text[i] %>% 
		  detangle() %>% 
		  mk_record() %>%
		  rbind(stub) -> stub
		  } 
		 
		colnames(stub) <- header

		pander::pander(stub)
1 Like

Hi @technocrat, thanks a lot for replying, your methodology helped. Some progress has been made but some new challenges came up.

Speaking with some colleagues before I sent this post, the original goal was to create a tibble/dataframe as you mentioned, with the idea of having 1 note id per row, and the other variables as columns. The methodology you presented is similar to what we originally conceptualized. The absolute end goal will be to utilize this dataframe/tibble as a means to conduct text analysis along side structured data variables.

The several issues (and absolutely my mistake) was that - in my example text data that i tried to mimic - was the assumption of equal lines per note. When in reality doctors notes have a large variation in the amount they type. Furthermore, the issue with splitting the string with a tab delimiter ("\t") could possibly include doctors notes when they use tab a means for spacing instead of space, or carriage returns.

As a result this is what we have come up with, although my colleagues and I lack knowledge in for loops and the creation of functions, this is hopefully where I can get some help and further feedback from my code. I created some new fake data:

library(tidyverse)
library(here)

# version 2, mimics actual data as much as possible.
faketext <- c("note_id\tnetwork_id\tsite_id\tpatient_id\tnotetitle\tnote\tdate",
          "1\t8\t2\t5012312343\tphysician note\tdude is not doing so well, having issues smelling. will recommmend ct scan.",
          "plan will be to call back patient 1 week after scan.",
          "\t2019-09-25",
          "2\t8\t2\t5053245235\temergency visit\tkid in his 20s came in having a high fever.  prescribed antibiotics.",
          "recommended to come see after 3 days.",
          "\t2019-04-10",
          "\t2019-04-10",                                                                                                                 
          "3\t8\t2\t5053345677\tclinical note\tadult in 40s, bmi over 30, multiple co-morbidities.  recommended 30 minutes of walking daily,",
          "reduction of calorie intake. along with advise to ween off the number of cigarettes daily. ",                                 
          "will be tracking progress once a month.",                                                                                     
          "recommended to come see after 30 days.",                                                                                      
          "-md",                                                                                                                         
          "\t2019-07-20",                                                                                                                 
          "4\t8\t2\t5053784567\tclinical note\tvariation in text.",                                                                           
          "fakeone",                                                                                                                     
          "faketwo",                                                                                                                     
          "fakethree",                                                                                                                   
          "fakefour",                                                                                                                    
          "fakefive",                                                                                                                    
          "fake6",                                                                                                                       
          "\t2019-07-20"           
          )

# Regex for identifying the Note ID
note_id <- "([0-9]{1}\\t)"

# Cumulative Sum of all unique note Ids, this also indirectly counts the number of lines per note id
wholetext_id <- cumsum(str_detect(faketext, note_id))
wholetext_id

# storing cumsum in a tibble to count the ids
counting_ids <- wholetext_id %>%
tibble(
lines_per_id = wholetext_id)

# grouped ids to see the amount of lines per id (chunk)
grouped_ids <- group_by(counting_ids, lines_per_id) %>%
summarize(
count_lines = n())

# storing the record lines as a vector
record_lines <- grouped_ids$count_lines

# list for record ids and all text corresponding to record id
list_id_text <- bind_cols(record_id = wholetext_id, textdata = faketext) %>%
group_split(record_id)

# # this might be a for loop or function
# splitting by pattern of regex from all lists in list_id_text
storing_as_object <- str_split(list_id_text[[n]], note_id) %>%
unlist()

# storing results in a tibble
# this might be a for loop or function
df_final <- tibble(
text = storing_as_object[n]

The only issues left over is when i tried this method for one of the note fields my regex patterns completely broke (except for patient id) on my real dataset. I followed the spirit of your methodology, so i want to extend a great thank you. Again, my weakness is still applying for loops and functions, I just do not have enough practice with these to utilize them correctly and when.

Looking forward to hearing back.

Take care and stay safe,

  • corgi coder
1 Like

Hi,

I was pretty sure that the 3/line would be problematic for the record problem unless notes were captured in an app that enforced it through the way in which it captured and exported. That problem is solved, although I suggest simplifying it in the reprex below.

For the field problem, I've assumed the first three are numeric.

suppressPackageStartupMessages({
  library(purrr)
  library(stringr)
})

# Regex for identifying the Note ID
note_id <- "([0-9]{1}\\t)"

####################################
# took out first element of faketext
####################################

faketext <- c(
  "1\t8\t2\t5012312343\tphysician note\tdude is not doing so well, having issues smelling. will recommmend ct scan.",
  "plan will be to call back patient 1 week after scan.",
  "\t2019-09-25",
  "2\t8\t2\t5053245235\temergency visit\tkid in his 20s came in having a high fever.  prescribed antibiotics.",
  "recommended to come see after 3 days.",
  "\t2019-04-10",
  "\t2019-04-10",
  "3\t8\t2\t5053345677\tclinical note\tadult in 40s, bmi over 30, multiple co-morbidities.  recommended 30 minutes of walking daily,",
  "reduction of calorie intake. along with advise to ween off the number of cigarettes daily. ",
  "will be tracking progress once a month.",
  "recommended to come see after 30 days.",
  "-md",
  "\t2019-07-20",
  "4\t8\t2\t5053784567\tclinical note\tvariation in text.",
  "fakeone",
  "faketwo",
  "fakethree",
  "fakefour",
  "fakefive",
  "fake6",
  "\t2019-07-20"
)

count_records <- function(x) {
  a <- cumsum(str_detect(x, note_id))
  b <- rle(a)$lengths
  d <- cumsum(b)
  e <- d-b+1
  return(seq(1:length(e)))
}

prep_records <- function(i) {
    a <- cumsum(str_detect(faketext, note_id))
    b <- rle(a)$lengths
    d <- cumsum(b)
    e <- d-b+1
    r <- seq(1:length(e))
    faketext[e[i]:d[i]]
}

mk_records <- function(x) {
  count_records(x) %>% map(prep_records) 
}

mk_records(faketext)
#> [[1]]
#> [1] "1\t8\t2\t5012312343\tphysician note\tdude is not doing so well, having issues smelling. will recommmend ct scan."
#> [2] "plan will be to call back patient 1 week after scan."                                                            
#> [3] "\t2019-09-25"                                                                                                    
#> 
#> [[2]]
#> [1] "2\t8\t2\t5053245235\temergency visit\tkid in his 20s came in having a high fever.  prescribed antibiotics."
#> [2] "recommended to come see after 3 days."                                                                     
#> [3] "\t2019-04-10"                                                                                              
#> [4] "\t2019-04-10"                                                                                              
#> 
#> [[3]]
#> [1] "3\t8\t2\t5053345677\tclinical note\tadult in 40s, bmi over 30, multiple co-morbidities.  recommended 30 minutes of walking daily,"
#> [2] "reduction of calorie intake. along with advise to ween off the number of cigarettes daily. "                                      
#> [3] "will be tracking progress once a month."                                                                                          
#> [4] "recommended to come see after 30 days."                                                                                           
#> [5] "-md"                                                                                                                              
#> [6] "\t2019-07-20"                                                                                                                     
#> 
#> [[4]]
#> [1] "4\t8\t2\t5053784567\tclinical note\tvariation in text."
#> [2] "fakeone"                                               
#> [3] "faketwo"                                               
#> [4] "fakethree"                                             
#> [5] "fakefour"                                              
#> [6] "fakefive"                                              
#> [7] "fake6"                                                 
#> [8] "\t2019-07-20"

Created on 2020-12-12 by the reprex package (v0.3.0.9001)

Part two parses the records created, ending up with results containing the numeric fields and date character string (use lubridate::ymd("2020-12-12") to convert to date object) with a combined text character string in the middle.

suppressPackageStartupMessages({
  library(purrr)
  library(stringr)
})

# Regex for identifying the Note ID
note_id <- "([0-9]{1}\\t)"

####################################
# took out first element of faketext
####################################

faketext <- c(
  "1\t8\t2\t5012312343\tphysician note\tdude is not doing so well, having issues smelling. will recommmend ct scan.",
  "plan will be to call back patient 1 week after scan.",
  "\t2019-09-25",
  "2\t8\t2\t5053245235\temergency visit\tkid in his 20s came in having a high fever.  prescribed antibiotics.",
  "recommended to come see after 3 days.",
  "\t2019-04-10",
  "\t2019-04-10",
  "3\t8\t2\t5053345677\tclinical note\tadult in 40s, bmi over 30, multiple co-morbidities.  recommended 30 minutes of walking daily,",
  "reduction of calorie intake. along with advise to ween off the number of cigarettes daily. ",
  "will be tracking progress once a month.",
  "recommended to come see after 30 days.",
  "-md",
  "\t2019-07-20",
  "4\t8\t2\t5053784567\tclinical note\tvariation in text.",
  "fakeone",
  "faketwo",
  "fakethree",
  "fakefour",
  "fakefive",
  "fake6",
  "\t2019-07-20"
)

count_records <- function(x) {
  a <- cumsum(str_detect(x, note_id))
  b <- rle(a)$lengths
  d <- cumsum(b)
  e <- d-b+1
  return(seq(1:length(e)))
}


mk_entry <- function(x) {
  c(row_id = x[1], 
    patient_id = x[2], 
    unknown = x[3],
    note_id = x[4],
    note = x[5], 
    date = x[6])
}

prep_records <- function(i) {
    a <- cumsum(str_detect(faketext, note_id))
    b <- rle(a)$lengths
    d <- cumsum(b)
    e <- d-b+1
    r <- seq(1:length(e))
    faketext[e[i]:d[i]]
}

mk_records <- function(x) {
  count_records(x) %>% map(prep_records) 
}


mk_record <- function(x) {
  c(row_id = x[1], patient_id = x[2], note_id = x[3],
    note = x[4], date = x[5])
}
mk_records(faketext) -> records

records
#> [[1]]
#> [1] "1\t8\t2\t5012312343\tphysician note\tdude is not doing so well, having issues smelling. will recommmend ct scan."
#> [2] "plan will be to call back patient 1 week after scan."                                                            
#> [3] "\t2019-09-25"                                                                                                    
#> 
#> [[2]]
#> [1] "2\t8\t2\t5053245235\temergency visit\tkid in his 20s came in having a high fever.  prescribed antibiotics."
#> [2] "recommended to come see after 3 days."                                                                     
#> [3] "\t2019-04-10"                                                                                              
#> [4] "\t2019-04-10"                                                                                              
#> 
#> [[3]]
#> [1] "3\t8\t2\t5053345677\tclinical note\tadult in 40s, bmi over 30, multiple co-morbidities.  recommended 30 minutes of walking daily,"
#> [2] "reduction of calorie intake. along with advise to ween off the number of cigarettes daily. "                                      
#> [3] "will be tracking progress once a month."                                                                                          
#> [4] "recommended to come see after 30 days."                                                                                           
#> [5] "-md"                                                                                                                              
#> [6] "\t2019-07-20"                                                                                                                     
#> 
#> [[4]]
#> [1] "4\t8\t2\t5053784567\tclinical note\tvariation in text."
#> [2] "fakeone"                                               
#> [3] "faketwo"                                               
#> [4] "fakethree"                                             
#> [5] "fakefour"                                              
#> [6] "fakefive"                                              
#> [7] "fake6"                                                 
#> [8] "\t2019-07-20"

# example with the longest record
last_record <- records[[length(records)]]
tab <- "\\t"
the_date <- str_remove(last_record[length(last_record)],tab)
first_fields <- last_record[1]
numerics <- str_split(first_fields,tab)[[1]][1:4]
uppertext <- str_split(first_fields,tab)[[1]][5:length(str_split(first_fields,tab)[[1]])]
lowertext <- last_record[2:(length(last_record)-1)]
texts <- paste(c(uppertext,lowertext), collapse = " ")

result <- c(numerics,texts,the_date)

pander::pander(mk_entry(result))
Table continues below
row_id patient_id unknown note_id note
4 8 2 5053784567 clinical note variation in text. fakeone faketwo fakethree fakefour fakefive fake6

Table continues below

date
2019-07-20

Created on 2020-12-12 by the reprex package (v0.3.0.9001)

Something like that?

library(data.table)
library(stringr)

# version 2, mimics actual data as much as possible.
l <- c("note_id\tnetwork_id\tsite_id\tpatient_id\tnotetitle\tnote\tdate",
			 "1\t8\t2\t5012312343\tphysician note\tdude is not doing so well, having issues smelling. will recommmend ct scan.",
			 "plan will be to call back patient 1 week after scan.",
			 "\t2019-09-25",
			 "2\t8\t2\t5053245235\temergency visit\tkid in his 20s came in having a high fever.  prescribed antibiotics.",
			 "recommended to come see after 3 days.",
			 "\t2019-04-10",
			 "\t2019-04-10",                                                                                                                 
			 "3\t8\t2\t5053345677\tclinical note\tadult in 40s, bmi over 30, multiple co-morbidities.  recommended 30 minutes of walking daily,",
			 "reduction of calorie intake. along with advise to ween off the number of cigarettes daily. ",                                 
			 "will be tracking progress once a month.",                                                                                     
			 "recommended to come see after 30 days.",                                                                                      
			 "-md",                                                                                                                         
			 "\t2019-07-20",                                                                                                                 
			 "4\t8\t2\t5053784567\tclinical note\tvariation in text.",                                                                           
			 "fakeone",                                                                                                                     
			 "faketwo",                                                                                                                     
			 "fakethree 2019-07-20",                                                                                                                   
			 "fakefour",                                                                                                                    
			 "fakefive",                                                                                                                    
			 "fake6",                                                                                                                       
			 "\t2019-07-20"           
)

# Regex for identifying the Note ID
# use ^ to specify line has to start with a number
# + for 1 or more, {#} for a specific number
# Use \\s instead of \\t to specify any space character
n_id <- "^([0-9]{1})\\s"
n_id <- "^([0-9]+)\\s"

# here we get the line number where the note starts
n_str <- str_which(l, n_id)
# and ends
n_end <- c(tail(n_str, -1) - 1, length(l)) 
# here we find where the note are, assuming everyline belongs to a note
n_idx <- mapply(`:`, n_str, n_end)

# nows get those notes sorted out
n <- lapply(n_idx, function(x) paste(l[x], collapse = " "))

# We can start building a tibble from there
t <- data.table(id = as.numeric(str_match(n, n_id)[,2]))

# I lied, it's a data table, it prints in list vector and supports assign in place

# template function to add a column to our tibble using a custom regex
# set(t, colname, lapply(str_match_all(n, pattern), function(x) x[,group])

date_pattern <- "([12]\\d{3}-(0[1-9]|1[0-2])-(0[1-9]|[12]\\d|3[01]))"
set(t,
		j = "dates",
		value = lapply(str_match_all(n, date_pattern), function(x) x[,1])
)


t

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.