CSV file has multiple rows that should be one row

Larebear08 · March 26, 2020, 6:15pm

Hello,

I'm currently trying to read a csv file into R that is output data from a model. However, the file has rows that should all be one row (picture included to clarify what I mean).

I know that there aren't more column varibales than excel can handle. How would I convert each multiple row section into individual rows to allow me to create a neat dataframe.

technocrat · March 26, 2020, 6:23pm

This is what comes from using Excel as a relational database, I fear.

The screenshot isn't quite legible enough for me to make out the layout. Assuming it's something like

x1, x2, x3 ... x16284
1,2,3...
786,787,788, ...
14282,14383,14384
...

There's not much that can be done in R without turning summersaults with one-and-a-half gainers. If the number of extra row breaks were consistent, what I could do is write a script in sed, ruby,python, etc., to join the lines.

If that's not on the table, try dividing the Excel table into bite size chunks to be cbind glued.

If, however, the longish rows are just wrapping and divided by newlines (\n), what is the problem that you are seeing on import?

nirgrahamuk · March 26, 2020, 6:26pm

try read.csv with the sep paramater set to " "

dromano · March 26, 2020, 7:01pm

Hi @Larebear08, the column headers look like they might be separated by tabs -- do you know if that's true?

grosscol · March 26, 2020, 8:03pm

Looks ugly, and the screenshot makes it difficult to help. If you know the number of columns that should be there, and are sure that there isn't missing data, fixing this is doable if a bit ugly.

R only

If you're stuck using R only, I would go about this in the following way using the readr library:

Read entire file into char vector.
gsub to replace tabs with comma
gsub to replace newline and carriage returns with the empty string or comma. I can't tell which is appropriate from the screenshot.
Use which( input %in% ",") to find the indecies of all the commas.
Create a list of start and stop locations from the indecies and the known number ofl columns you should have. E.g. If you know there should be 24 columns, get the indecies of every 24th comma.
Use substr with mapply and the start and stop locations to make a list of rows.
Use lapply and a function to remove the trailing comma from each character vector in the list
Write the list of lines to a csv file as an intermediate output so you don't have to do the above again.
paste the lines together with a "\n" separator.
Pass the pasted together munged input to read_csv

Bash

If you're willing to try to unbork this on the command line, the following assumes the header columns are tab separated, the rows are comma separate, and there is no trailing comma or tab and the end of the rows.

#!/usr/bin/env bash

# Replace the 20 below with your number of columns.
DELIMTER_LIST=$(printf ',%.0s' {2..20})"\n"
cat input_file.csv | tr "\t," "\n" | paste -s -d"${DELIMTER_LIST}" > output_file.csv

I can't recall right now how to stuff a variable into that printf subshell off the top of my head, so just replace the 20 with the number of columns that should be in your data.

Create a reproducible example

I suggest making a smaller set of the input data that's more tractable for experimenting with. Put it up on pastebin or as a github gist or something. Check that it correctly replicates the problem or errors you're experiencing, and post a link to the subset.

technocrat · March 26, 2020, 10:31pm

If you can paste an example, just a few rows will do.

system · April 16, 2020, 10:31pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.