Fixing Broken Record in Text File - Embedded Carriage Return

andrewjmdata · September 14, 2020, 10:23am

Hi, I have a csv file which contains survey data where some fields have free form text. The survey is online and the data is kept in our CRM system. I have no control over the CRM or database in it.

The issue is where people enter multiple lines and press enter/return between each line. The csv we extract from the CRM when viewed in a text editor or excel looks like it has a new record at these points.

Is there an R technique that could identify that the record is missing the fields at the end of the record, go to the line below, get rid of the CR/LF and join the spurious record back onto the end of it?

Many thanks

Andrew

pieterjanvc · September 14, 2020, 3:43pm

Hi,

Could you show us an extract of the raw csv file where this issue happens? I think it should be an easy fix once I have an idea on how it's formatted now.

PJ

andrewjmdata · September 15, 2020, 9:22pm

Hi, Sorry for the delay in replying. Here is the data in csv format...

var1,var2,var3,var4,var5,var6
01/02/2020,var2_1,var3_1,var4_1,var5_1,var6_1
02/02/2020,var2_2,var3_2,var4_2,var5_2,var6_2
03/02/2020,var2_3,var3_3,var4_3,var5_3,var6_3
04/02/2020,var2_4,var3_4,,,
var5_1,var5_2,,,,
var6_4,,,,,
05/02/2020,var2_5,var3_5,var4_5,var5_5,var6_5
06/02/2020,var2_6,var3_6,var4_6,var5_6,var6_6
07/02/2020,var2_7,var3_7,var4_7,,var6_7
var8_1,,,,,
08/02/2020,var2_8,var3_8,var4_8,var5_8,var6_8
09/02/2020,var2_9,var3_9,var4_9,var5_9,var6_9

nirgrahamuk · September 16, 2020, 7:55am

This looks extremely odd, because each row has 5 commas, ...
normally csv exports would wrap strings with linebreaks and quotes of a one type with a higher order quote mark. but I suppose thats why you emphasised your lack of control of the CRM etc. ...
Also 3rd line from the bottom has var8_1 which implies there are 8 vars rather than maximum 6... ?

Do you have any guarantees about the data, i.e. that all 6 variables must be populated ? (non blank?)

andrewjmdata · September 16, 2020, 9:26am

Apologies, This was just test data that I created to try and solve problem in VBA (kind of worked). It should look like this. I have also accounted for the 3rd line from bottom.

var1,var2,var3,var4,var5,var6
01/02/2020,var2_1,var3_1,var4_1,var5_1,var6_1
02/02/2020,var2_2,var3_2,var4_2,var5_2,var6_2
03/02/2020,var2_3,var3_3,var4_3,var5_3,var6_3
04/02/2020,var2_4,var3_4
var5_1,var5_2
var6_4
05/02/2020,var2_5,var3_5,var4_5,var5_5,var6_5
06/02/2020,var2_6,var3_6,var4_6,var5_6,var6_6
07/02/2020,var2_7,var3_7,var4_7,var6_7
var8_1
08/02/2020,var2_8,var3_8,var4_8,var5_8,var6_8
09/02/2020,var2_9,var3_9,var4_9,var5_9,var6_9

nirgrahamuk · September 16, 2020, 9:45am

my attempt

library(tidyverse)
broke_csv <- "var1,var2,var3,var4,var5,var6
01/02/2020,var2_1,var3_1,var4_1,var5_1,var6_1
02/02/2020,var2_2,var3_2,var4_2,var5_2,var6_2
03/02/2020,var2_3,var3_3,var4_3,var5_3,var6_3
04/02/2020,var2_4,var3_4
var5_1,var5_2
var6_4
05/02/2020,var2_5,var3_5,var4_5,var5_5,var6_5
06/02/2020,var2_6,var3_6,var4_6,var5_6,var6_6
07/02/2020,var2_7,var3_7,var4_7,var6_7
var8_1
08/02/2020,var2_8,var3_8,var4_8,var5_8,var6_8
09/02/2020,var2_9,var3_9,var4_9,var5_9,var6_9"

(broke_frame <- str_split_fixed(broke_csv, "\n", n = Inf) %>% as.character() %>% enframe())

(bf_2 <- mutate(broke_frame,
  count = str_count(value, pattern = ","),
  complete = count == 5
))
(completes <- bf_2 %>% filter(complete))
(partials <- bf_2 %>% filter(!complete) %>% 
    mutate(lname = name - ifelse(is.na(lag(name)),
                                 name-1,
                                 lag(name)) != 1,
           groupcounter = cumsum(lname)))

(partials_combine <- partials %>% 
    group_by(groupcounter) %>% 
    summarise(value=paste0(value,collapse=",")))

(final_as_text <- c(pull(completes,value),
                    pull(partials_combine,value)))

# virtual read as csv

example_df <- function(intext) {
tf <- tempfile()
writeLines(intext, con = tf)
require(tidyverse)
as_tibble(read.csv(tf))
}

(df <- example_df(final_as_text) %>% arrange(var1))

andrewjmdata · September 16, 2020, 10:28am

Looks like it works! Many thanks. I was only expected someone to say you can do that using such and such a library/method.

Regards

Andrew

pieterjanvc · September 16, 2020, 11:12am

Hi,

I can see @nirgrahamuk beat me to it, but since I had a solution as well I shall share it anyway

I took a regex approach so it's a bit different, but the results should be the same.

library(stringr)
#Get the raw data by line
oldData = readLines("data.csv")
#Extract first row was column names (and count)
cols = str_split(oldData[1], ",")[[1]]
#Collapse the rest into one string
oldData = paste(oldData[-1], collapse = ",")

#Extract all fields 
oldData = str_match_all(
  oldData, 
  "(,|\\r?\\n|^)([^\",\\r\\n]+|\"(?:[^\"]|\"\")*\")?"
  )
#Remove quotes at start or end of string
oldData = str_remove_all(oldData[[1]][,3], "^\"|\"$")
#Replace "" by \"
oldData = str_replace_all(oldData, "\"\"", "\"")

#Build new dat frame
newData = as.data.frame(matrix(oldData, ncol = length(cols), byrow = T))
colnames(newData) = cols

> newData
        var1   var2   var3   var4   var5   var6
1 01/02/2020 var2_1 var3_1 var4_1 var5_1 var6_1
2 02/02/2020 var2_2 var3_2 var4_2 var5_2 var6_2
3 03/02/2020 var2_3 var3_3 var4_3 var5_3 var6_3
4 04/02/2020 var2_4 var3_4 var5_1 var5_2 var6_4
5 05/02/2020 var2_5 var3_5 var4_5 var5_5 var6_5
6 06/02/2020 var2_6 var3_6 var4_6 var5_6 var6_6
7 07/02/2020 var2_7 var3_7 var4_7 var6_7 var8_1
8 08/02/2020 var2_8 var3_8 var4_8 var5_8 var6_8
9 09/02/2020 var2_9 var3_9 var4_9 var5_9 var6_9

PJ

nirgrahamuk · September 16, 2020, 11:13am

The regex is strong with you

system · October 7, 2020, 11:13am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.