Separate csv with commas but not names with comma

Lekmonm · July 23, 2022, 3:55am

Hi, I have a CSV file that I need to import into RStudio. The columns are comma separated, but there are names that include the comma: Benzene, 1,3-dimethyl- and I don't need to separate them, however the function separates them:

separate(df, sep =",")

How can I separate the columns according to the comma, but keep the name that contains the comma, without separating it?

This is an extract from de csv file:

Component RT,Compound Name,Component Area,Match Factor,CAS#,Formula,Estimated Conc.
18.6510,"Benzene, 1,3-dimethyl-",1370644.1,98.5,108-38-3,C8H10,

FJCC · July 23, 2022, 4:30am

I made a file with this content

Component RT,Compound Name,Component Area,Match Factor,CAS#,Formula,Estimated Conc.
18.6510,"Benzene, 1,3-dimethyl-",1370644.1,98.5,108-38-3,C8H10,45

and used read.csv to import it successfully.

DF <- read.csv("~/R/Play/Dummy.csv")
DF
#>   Component.RT          Compound.Name Component.Area Match.Factor     CAS.
#> 1       18.651 Benzene, 1,3-dimethyl-        1370644         98.5 108-38-3
#>   Formula Estimated.Conc.
#> 1   C8H10              45

^{Created on 2022-07-22 by the reprex package (v2.0.1)}
What code have you used to import the data?

Lekmonm · July 23, 2022, 4:51am

The same code. Maybe I forgot to mention that my original file, when imported with read_csv, generates 2 columns, so after importing it, I separated it into the columns I needed and it happened that it separated all the text:

library(readr)
library(tidyr)
prueba2 <- read_csv("C:/Users/camil/Desktop/prueba2.csv",
col_names = FALSE, trim_ws = FALSE)
#> Warning: One or more parsing issues, see `problems()` for details
#> Rows: 4 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): X1, X2
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

prueba2
#> # A tibble: 4 × 2
#>   X1                                                                     X2     
#>   <chr>                                                                  <chr>  
#> 1 "Batch Path"                                                           "D:\\M…
#> 2 "Analysis File Name"                                                   "Unk_2…
#> 3 "Component RT"                                                         "Compo…
#> 4 "22.0522,\"Benzene, 1,2,4-trimethyl-\",105223.0,93.7,95-63-6,C9H12,\r"  <NA>

a <- prueba2 %>% 
  separate(X1,
           sep =",",
           c("a1","a2","a3","a4","a5","a6","a7"))
#> Warning: Expected 7 pieces. Additional pieces discarded in 1 rows [4].
#> Warning: Expected 7 pieces. Missing pieces filled with `NA` in 3 rows [1, 2, 3].

a
#> # A tibble: 4 × 8
#>   a1                 a2          a3    a4    a5               a6     a7    X2   
#>   <chr>              <chr>       <chr> <chr> <chr>            <chr>  <chr> <chr>
#> 1 Batch Path          <NA>        <NA> <NA>   <NA>            <NA>   <NA>  "D:\…
#> 2 Analysis File Name  <NA>        <NA> <NA>   <NA>            <NA>   <NA>  "Unk…
#> 3 Component RT        <NA>        <NA> <NA>   <NA>            <NA>   <NA>  "Com…
#> 4 22.0522            "\"Benzene" " 1"  2     "4-trimethyl-\"" 10522… 93.7   <NA>

This is part of the csv file that I used:

Batch Path,D:\MassHunter\GCMS\1\data\2022\julio\08072022
Analysis File Name,Unk_2.uaf

Component RT,Compound Name,Component Area,Match Factor,CAS#,Formula,Estimated Conc.
22.0522,"Benzene, 1,2,4-trimethyl-",105223.0,93.7,95-63-6,C9H12,

FJCC · July 23, 2022, 5:29am

The file includes column names so you should not have col_names set to FALSE. Try

prueba2 <- read_csv("C:/Users/camil/Desktop/prueba2.csv",
col_names = TRUE, trim_ws = FALSE)

Lekmonm · July 24, 2022, 12:37am

I used your code, but is the same.

library(readr)
library(tidyr)
prueba2 <- read_csv("C:/Users/camil/Desktop/prueba2.csv",
                    col_names = TRUE, 
                    trim_ws = FALSE)
#> Warning: One or more parsing issues, see `problems()` for details
#> Rows: 3 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): Batch Path, D:\MassHunter\GCMS\1\data\2022\julio\08072022
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

prueba2
#> # A tibble: 3 × 2
#>   `Batch Path`                                                  `D:\\MassHunte…`
#>   <chr>                                                         <chr>           
#> 1 "Analysis File Name"                                          Unk_2.uaf       
#> 2 "Component RT"                                                Compound Name,C…
#> 3 "22.0522,\"Benzene, 1,2,4-trimethyl-\",105223.0,93.7,95-63-6… <NA>

I found an expression that allows me to identify only the commas to separate my text in column, without considering the commas inside the quotes, but I am not clear how to use it to separate my columns (it seems to me that it is for Java): RegExr: Learn, Build, & Test RegEx

FJCC · July 24, 2022, 6:01am

In the example you provided earlier, the first two column names were Component RT and Compound Name. Looking at your print out of prueba, those appear on the third line

#> # A tibble: 3 × 2
#>   `Batch Path`                                                  `D:\\MassHunte…`
#>   <chr>                                                         <chr>           
#> 1 "Analysis File Name"                                          Unk_2.uaf       
#> 2 "Component RT"                                                Compound Name,C…
#> 3 "22.0522,\"Benzene, 1,2,4-trimethyl-\",105223.0,93.7,95-63-6… <NA>

The column names are Batch Path and something that starts with D:\MassHunte. It looks like you have two lines of meta data before the actual data in the file. If that is true, you should set the skip argument to 2.

prueba2 <- read_csv("C:/Users/camil/Desktop/prueba2.csv",
col_names = TRUE, trim_ws = FALSE, skip = 2)

Have you looked at the csv file with a plain text editor to see what its structure is?

Lekmonm · July 24, 2022, 2:27pm

Thanks, but I need to use all the information in the csv file:

Batch Path,D:\MassHunter\GCMS\1\data\2022\julio\08072022
Analysis File Name,Unk_2.uaf
Analyst Name,LAC01,Analysis Time,7/11/2022 4:25:24 PM
File Name,080722 12.D,Path Name,D:\MassHunter\GCMS\1\data\2022\julio\08072022
Sample Name,LS-137-22,Type,Sample
Acq. Method File,VOC_PAMS_Scott_SCAN,Acq. Method Path,D:\MassHunter\GCMS\1\methods\
Acq. Date-Time,7/8/2022 10:54:14 PM,Acq. Operator,MOC
Instrument Name,LAC01,Dil.,1
Component RT,Compound Name,CAS#,Formula,Component Area,Match Factor,Estimated Conc.
7.3114,"Propanal, 2-methyl-",78-84-2,C4H8O,70086.5,92.8,
,,,,,,
8.1251,Butanal,123-72-8,C4H8O,192912.5,97.5,
,,,,,,
10.1817,"Silanediol, dimethyl-",1066-42-8,C2H8O2Si,601151.0,98.4,
,,,,,,
10.6182,Benzene,71-43-2,C6H6,302109.7,96.1,
,,,,,,
Sample Chromatogram

I'm writing a code who read the dataframe and create a new clean df, like this image:

Thank you for your time.

MonkmanMH · July 28, 2022, 1:54pm

Hi @Lukmonm!
This is quite the challenge that you're confronting...the more questions that are asked, the more complex the problem seems to be. As you have seen, the {readr} function read_csv() will correctly parse a text value if that value is enclosed within quotation marks, which was your original question.
As @FJCC asked in their most recent reply, it may be an issue with the structure of the original CSV file.
In your last post you posted an image that hints at the fact that perhaps you have the file read correctly into Excel...is that true?
In that same post, you've pasted some of the data from the CSV file, but it almost seems as if the components of what should be a single record (row) are on multiple rows, separated by other records.
It would be very helpful if you were able to make available a file with a few records (similar to the image you posted), as the original CSV file (or, as hinted at by the image, an Excel format file).

Lekmonm · August 17, 2022, 7:34pm

Thank you very much. I managed to solve the problem thanks to a colleague who helped me with this code:

mutate("Compound Name" = str_remove_all(str_match(orig,'\"\\s*(.*?)\\s*\"')[,1],'"'),orig = str_remove_all(orig,',\"\\s*(.*?)\\s*\"'))

it was that structure ('\"\\s*(.*?)\\s*\"')[,1],'"') that allowed me to identify the instances of comma-separated text that should be in a column.

system · September 7, 2022, 7:35pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.