CSVs have un separated junk of varying length above what should be the header. How import just headers and data.

Absolute R beginner here. Low level Python experience. Please let me know if I should post this elsewhere.

The software that came with a device I am using to collect light absorbance data exports CSVs that have a bunch of non-comma separated text above the row that should be the header. This text occupies varying numbers of rows depending on the settings of the instrument when I use it.

I want to be able to graph dozens of these CSVs with the same r code by simply changing the filepath to the CSV I am graphing. Currently, I have to adjust the "skip" argument in read.csv and the number of lines I have to skip seems arbitrary, ie it does not correlate with the number of rows I would skip if looking at the data in excel. The header rows are quite aways down and look like this:
"SampleA, SourceA, SampleB, SourceB, Position, AbsA, AbsB, ratioA/B, Fraction Number, Fraction Volume (ml)"

Any suggestions would be greatly appreciated.

My code so far an

r
csvdata <- read.csv("filepath.csv", skip =152)
print(csvdata)
par(mar = c(5,4,4,4) + 0.1)

plot(csvdata$Position, csvdata$AbsA, type = "l", ylim=c(0,0.6), ylab = "UV OD", xlab = "Position (mm)")

par(new = TRUE)

plot(csvdata$Position, csvdata$SampleFluor, type ="l", col="red",lty=1, xaxt = "na", yaxt = "na",
     ylab = "", xlab = "", )
axis(side=4, col="red", col.axis="red")
mtext("Fluor", side=4, col="red", line = 3)

I could not attach a CSV so I pasted in two sets of data below. Links to the files are here:
Data 1
Data 2

Data 1:

FlowCell Text Data File: 200203_UV-UV_759_180min.csv

Software Version: 1.56A

Date: 2020-02-03

Time: 18:07:12

User: ANTJE

Experiment name:

*** CENTRIFUGE SETTINGS ***

Rotor Manufacturer: Beckman

Rotor Model: SW41Ti

Speed: 39 rpm

Gradient: Sucr 10.0% to 40.0% v/v

Time: 2.5 hrs

w2t: x 10^0

Number of Tubes: 2

Sample Volume: 0.1 ml

Temperature: 4.0 deg C

*** FRACTIONATION SETTINGS ***

Volume displaced/mm: 0.143 ml/mm

Tubing length: 290 mm

Total Dead Volume (= last fraction): 0.299 ml

Scan Speed: 0.20 mm/sec

Total Distance: 78.3 mm

Upper Limit to Slow-down Distance: 43.90 mm

Start mode: Slow-down distance

Fail-safe Distance (M): 0 mm

Number of fractions: 15

Distance/fraction: 5.22 mm

Start Distance: 0.00 mm

Volume/fraction: 0.744 ml

Data Samples/mm: 10.00

*** SCANNING SETTINGS ***

Scanning mode: DUAL UV OD

Channel A (LED1) Wavelength: 260 nm

Channel B (LED2) Wavelength: 280 nm

Sensitivity: 0

Integration time: 500 ms

Averaging: 1 samples

LED1 (260 nm) On Time: 32 ms

PD2 (280 nm) On Time: 18 ms

Sample A Zero: 854362

Source A Zero: 355460.9

Sample B Zero: 887276.7

Source B Zero: 303166.5

Data Columns:

SampleA, SourceA, SampleB, SourceB, Position, AbsA, AbsB, ratioA/B, Fraction Number, Fraction Volume (ml)

681878, 355923, 706993, 303465, 0.00, 0.510406, 0.396272, 1.288020, ,

682769, 355789, 709765, 303439, 0.14, 0.506620, 0.389326, 1.301276, ,

682893, 355889, 720956, 303398, 0.24, 0.506844, 0.361915, 1.400450, ,

688963, 355972, 731055, 303379, 0.33, 0.487454, 0.337642, 1.443700, ,

689440, 355957, 731629, 303436, 0.43, 0.485801, 0.336605, 1.443239, ,

Data 2:

FlowCell Text Data File: 200203_UV-VIS_MRE_180min.csv

Software Version: 1.56A

Date: 2020-02-03

Time: 17:16:44

User: ANTJE

Experiment name: 200203_UV-VIS_UV-UV_test

*** CENTRIFUGE SETTINGS ***

Rotor Manufacturer: Beckman

Rotor Model: SW41Ti

Speed: 39 rpm

Gradient: Sucr 10.0% to 40.0% v/v

Time: 2.5 hr

w2t: x 10^0

Number of Tubes: 2

Sample Volume: 0.1 ml

Temperature: 4.0 deg C

*** FRACTIONATION SETTINGS ***

Volume displaced/mm: 0.143 ml/mm

Tubing length: 290 mm

Total Dead Volume (= last fraction): 0.299 ml

Scan Speed: 0.20 mm/sec

Total Distance: 78.3 mm

Upper Limit to Slow-down Distance: 43.90 mm

Start mode: Slow-down distance

Fail-safe Distance (M): 0 mm

Number of fractions: 15

Distance/fraction: 5.22 mm

Start Distance: 0.00 mm

Volume/fraction: 0.744 ml

Data Samples/mm: 10.00

*** SCANNING SETTINGS ***

Scanning mode: UV OD WITH SINGLE FLUORESCENCE SCAN

Channel A (LED1) Wavelength: 260 nm

Channel B (LED2) Fluor: EGFP

Excitation (LED2) Wavelength: 488 nm

Excitation filter A: 470/40 nm

Emission filter A: 525/50 nm

Sensitivity: 0

Integration time: 500 ms

Averaging: 1 samples

LED1 (260 nm) On Time: 16 ms

PD2 (488 nm) On Time: 500 ms

Sample A Zero: 839211

Source A Zero: 216083.2

Fluorescence Zero: 3460

Data Columns:

SampleA, SourceA, SampleFluor, Position, AbsA, Fraction Number, Fraction Volume (ml)

699792, 216280, -2644, 0.00, 0.410904, ,

657176, 216202, -2190, 0.12, 0.551490, ,

649647, 216211, -1364, 0.22, 0.577515, ,

633737, 216198, -1419, 0.32, 0.633180, ,

621648, 216260, -1763, 0.42, 0.677168, ,

621802, 216258, -1696, 0.52, 0.676590, ,

621645, 216203, -1664, 0.62, 0.676586, ,

622337, 216267, -1632, 0.72, 0.674748, ,

622471, 216248, -1625, 0.82, 0.674066, ,

622674, 216363, -1599, 0.92, 0.674529, ,

621880, 216268, 485, 1.02, 0.676412, ,

199233, 216276, 2905, 1.12, 3.238103, ,

610639, 216212, -2590, 1.22, 0.716880, ,

664004, 216213, -2504, 1.32, 0.528344, ,

664473, 216158, -2532, 1.42, 0.526182, ,

668813, 216170, -2543, 1.52, 0.511656, ,

There seems to be a pattern that a row preceding the header has the content "Data Columns:"
So you would benefit by splitting your files on this string to get a pure CSV

You can do something like this

library(tidyverse)

raw_data <- read_lines("sample.csv", skip_empty_rows = TRUE)
skip_n = str_which(raw_data, "Data Columns:")
clean_data <- read_csv(raw_data, skip = skip_n)
2 Likes

That makes sense. This would happen outside of R? Can you suggest a way to automate this?

Thanks for your answer. This is what I was hoping for. For some reason, the str_which() is returning the wrong number so nothing is working yet. Any suggestions? I show what I have done below.

Thanks again.

I installed the tidyverse package and then ran:

library(tidyverse)

csvname <- "200131_759_180min.csv"
raw_data <- read_lines(csvname, skip_empty_rows = TRUE)
skip_n = str_which(raw_data, "Data Columns:")
csvdata <- read_csv(raw_data, skip = skip_n)
print(csvdata)
par(mar = c(5,4,4,4) + 0.1)

plot(csvdata$Position, csvdata$AbsA, type = "l", ylim=c(0,0.6), ylab = "UV OD", xlab = "Position (mm)")

par(new = TRUE)

plot(csvdata$Position, csvdata$SampleFluor, type ="l", col="red",lty=1, xaxt = "na", yaxt = "na",
     ylab = "", xlab = "", )
axis(side=4, col="red", col.axis="red")
mtext("Fluor", side=4, col="red", line = 3)

The result I got was:

library(tidyverse)

csvname <- "200131_759_180min.csv"
raw_data <- read_lines(csvname, skip_empty_rows = TRUE)
skip_n = str_which(raw_data, "Data Columns:")
csvdata <- read_csv(raw_data, skip = skip_n)
Warning: 776 parsing failures.
row col expected actual file
23 -- 1 columns 7 columns literal data
24 -- 1 columns 7 columns literal data
25 -- 1 columns 7 columns literal data
26 -- 1 columns 7 columns literal data
27 -- 1 columns 7 columns literal data
... ... ......... ......... ............
See problems(...) for more details.

print(csvdata)

A tibble: 798 x 1

Start mode: Slow-down distance

1 Fail-safe Distance (M): 0 mm
2 Number of fractions: 15
3 Distance/fraction: 5.22 mm
4 Start Distance: 0.00 mm
5 Volume/fraction: 0.744 ml
6 Data Samples/mm: 10.00
7 *** SCANNING SETTINGS ***
8 Scanning mode: UV OD WITH SINGLE FLUORESCENCE SCAN
9 Channel A (LED1) Wavelength: 260 nm
10 Channel B (LED2) Fluor: EGFP

… with 788 more rows

par(mar = c(5,4,4,4) + 0.1)

plot(csvdata$Position, csvdata$AbsA, type = "l", ylim=c(0,0.6), ylab = "UV OD", xlab = "Position (mm)")
Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: Unknown or uninitialised column: 'Position'.
2: Unknown or uninitialised column: 'AbsA'.
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf

par(new = TRUE)

plot(csvdata$Position, csvdata$SampleFluor, type ="l", col="red",lty=1, xaxt = "na", yaxt = "na",

  •  ylab = "", xlab = "", )
    

Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: Unknown or uninitialised column: 'Position'.
2: Unknown or uninitialised column: 'SampleFluor'.
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
5: In min(x) : no non-missing arguments to min; returning Inf
6: In max(x) : no non-missing arguments to max; returning -Inf

axis(side=4, col="red", col.axis="red")
mtext("Fluor", side=4, col="red", line = 3)

After your read_csv, If there are parsing problems, a warning tells you how many, and you can retrieve the details with problems() .

I guess when there are empty rows, that are skipped (removed) it doesn't fit to the the dataset imported later?
So maybe it works when you either do not have the " skip_empty_rows = TRUE" in the read_lines or have it in both.

That was it! Thanks everybody.

Here is my working code for future reference

library(tidyverse)

csvname <- "200131_759_180min.csv"
raw_data <- read_lines(csvname)
skip_n = str_which(raw_data, "Data Columns:")
print(skip_n)

#skip_n = str_which(raw_data, "Data Columns")


csvdata <- read_csv(raw_data, skip = skip_n)
print(csvdata)
par(mar = c(5,4,4,4) + 0.1)

plot(csvdata$Position, csvdata$AbsA, type = "l", ylim=c(0,0.6), ylab = "UV OD", xlab = "Position (mm)")

par(new = TRUE)

plot(csvdata$Position, csvdata$SampleFluor, type ="l", col="red",lty=1, xaxt = "na", yaxt = "na",
     ylab = "", xlab = "", )
axis(side=4, col="red", col.axis="red")
mtext("Fluor", side=4, col="red", line = 3)
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.