joining multiple lines into one

So, i have this CSV that i made into a data frame and it has some lines that are split into 2 or 3 lines when they should be one
EX: line 75, 76 and 77 should be one line.


I wonder how i can make them into one line and separated by ";" like all the other lines are.

Screenshots are all but useless, although this one is unusually legible. The better way to illustrate the problem is with a a reprex. See the FAQ.

As near as I can make out, these are lines that do not begin with an identifier such as R17. That, with rle() can probably be adapted to the purpose.

All right, i tried getting an example in code, the labels didn't come with though

	saude[c(70:78), ]
1	430005;ÁGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Feminino;30 a 39;RT-PCR;27/11/2020;26/11/2020;02/12/2020;;RECUPERADO;NAO;SIM;NAO;NAO;NAO;NAO;;NAO;;10/12/2020;BRANCA;NAO ENCONTRADO;NAO;INTERIOR;NAO;E-SUS;BRASIL;NAO
2	430005;ÁGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Feminino;20 a 29;RT-PCR;04/12/2020;25/11/2020;08/12/2020;;RECUPERADO;NAO;NAO;NAO;SIM;SIM;NAO;;NAO;;09/12/2020;BRANCA;NAO ENCONTRADO;NAO;VILA RECH;NAO;E-SUS;BRASIL;NAO
3	430005;ÁGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Feminino;15 a 19;TESTE RÁPIDO;27/11/2020;20/11/2020;02/12/2020;;RECUPERADO;NAO;NAO;NAO;SIM;NAO;SIM;;NAO;;04/12/2020;BRANCA;NAO ENCONTRADO;NAO;INTERIOR;NAO;E-SUS;BRASIL;NAO
4	430005;ÁGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Feminino;20 a 29;TESTE RÁPIDO;07/12/2020;03/12/2020;09/12/2020;;RECUPERADO;NAO;NAO;NAO;SIM;NAO;NAO;;NAO;;17/12/2020;BRANCA;NAO ENCONTRADO;NAO;CENTRO;NAO;E-SUS;BRASIL;NAO
5	430005;ÁGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Masculino;50 a 59;RT-PCR;01/12/2020;26/11/2020;08/12/2020;;RECUPERADO;NAO;SIM;SIM;NAO;SIM;NAO;;NAO;;10/12/2020;BRANCA;NAO ENCONTRADO;NAO;INTERIOR;NAO;E-SUS;BRASIL;NAO
6	430005;ÁGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Masculino;60 a 69;TESTE RÁPIDO;07/12/2020;23/11/2020;08/12/2020;19/12/2020;RECUPERADO;SIM;NAO;SIM;NAO;SIM;SIM;Doença Cardiovascular Crônica
7	 Diabetes mellitus
8	 Obesidade;NAO;;;BRANCA;NAO ENCONTRADO;NAO INFORMADO;INTERIOR;SIM;SIVEP HOSP;BRASIL;NAO
9	430005;ÁGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Masculino;40 a 49;RT-PCR;04/12/2020;01/11/2020;16/12/2020;;RECUPERADO;NAO;SIM;NAO;SIM;NAO;NAO;;NAO;;15/11/2020;BRANCA;NAO ENCONTRADO;NAO;CENTRO;NAO;E-SUS;BRASIL;NAO

I was thinking that every correct line starts with a number and every wrong one doesn't so i could use this somehow to stick the wrong lines together adding the ";" in the process

Thanks. As often happens, there is more than the one problem. Here, it is that lines 6-8, which we want to fix, have more potential semicolon ; delimited variables than we'd like.

I'll use saude.txt as the name of the file from which the data come. From a terminal session (not the console)

awk -v FS=";" '{print NR,NF}' saude.txt
1 30
2 30
3 30
4 30
5 30
6 19
7 1
8 12
9 30

shows this, and it looks that the correct number is 30. As a first step, we want to isolate those lines from the incorrect lines

awk -v FS=";" '{if (NF == 30) print}' saude.txt
430005;AGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Feminino;30 a 39;RT-PCR;27/11/2020;26/11/2020;02/12/2020;;RECUPERADO;NAO;SIM;NAO;NAO;NAO;NAO;;NAO;;10/12/2020;BRANCA;NAO ENCONTRADO;NAO;INTERIOR;NAO;E-SUS;BRASIL;NAO
430005;AGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Feminino;20 a 29;RT-PCR;04/12/2020;25/11/2020;08/12/2020;;RECUPERADO;NAO;NAO;NAO;SIM;SIM;NAO;;NAO;;09/12/2020;BRANCA;NAO ENCONTRADO;NAO;VILA RECH;NAO;E-SUS;BRASIL;NAO
430005;AGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Feminino;15 a 19;TESTE RAPIDO;27/11/2020;20/11/2020;02/12/2020;;RECUPERADO;NAO;NAO;NAO;SIM;NAO;SIM;;NAO;;04/12/2020;BRANCA;NAO ENCONTRADO;NAO;INTERIOR;NAO;E-SUS;BRASIL;NAO
430005;AGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Feminino;20 a 29;TESTE RAPIDO;07/12/2020;03/12/2020;09/12/2020;;RECUPERADO;NAO;NAO;NAO;SIM;NAO;NAO;;NAO;;17/12/2020;BRANCA;NAO ENCONTRADO;NAO;CENTRO;NAO;E-SUS;BRASIL;NAO
430005;AGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Masculino;50 a 59;RT-PCR;01/12/2020;26/11/2020;08/12/2020;;RECUPERADO;NAO;SIM;SIM;NAO;SIM;NAO;;NAO;;10/12/2020;BRANCA;NAO ENCONTRADO;NAO;INTERIOR;NAO;E-SUS;BRASIL;NAO
430005;AGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Masculino;40 a 49;RT-PCR;04/12/2020;01/11/2020;16/12/2020;;RECUPERADO;NAO;SIM;NAO;SIM;NAO;NAO;;NAO;;15/11/2020;BRANCA;NAO ENCONTRADO;NAO;CENTRO;NAO;E-SUS;BRASIL;NAO
!1145 > good.txt
awk -v FS=";" '{if (NF == 30) print}' saude.txt > good.txt

and put the others aside

430005;AGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Masculino;60 a 69;TESTE RAPIDO;07/12/2020;23/11/2020;08/12/2020;19/12/2020;RECUPERADO;SIM;NAO;SIM;NAO;SIM;SIM;Doena Cardiovascular Clinica
 Diabetes mellitus
 Obesidade;NAO;;;BRANCA;NAO ENCONTRADO;NAO INFORMADO;INTERIOR;SIM;SIVEP HOSP;BRASIL;NAO

awk -v FS=";" '{if (NF != 30) print}' saude.txt > bad.txt

Now a decision must be made on which variable in the last line is surplus. Then we can work on fixing it.

Interesting! Thank you. I didn't know this awk command and i don't know if i can use it in my Rstudio on windows but you've shown me the problem.
It's not ideal but i guess i could just discard any lines that don't have 30. I'm trying to figure out how to do this.
It seems like the error in the data varies, some other lines have different missing data but they seem to always start with not a number so i was a thinking of detecting that to pick which lines to kill

430005;ÁGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Feminino;30 a 39;TESTE RÁPIDO;24/08/2020;13/08/2020;25/08/2020;;RECUPERADO;NAO;NAO;NAO;NAO;NAO;SIM;;NAO;;27/08/2020;BRANCA;NAO ENCONTRADO;NAO;CENTRO;NAO;E-SUS;BRASIL;NAO
430005;ÁGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Feminino;50 a 59;RT-PCR;23/08/2020;12/08/2020;25/08/2020;01/09/2020;OBITO;SIM;SIM;SIM;SIM;SIM;NAO;Doença Cardiovascular Crônica
 Obesidade
 Outros;NAO;02/09/2020;;BRANCA;KAINGANG (CAINGANGUE);NAO INFORMADO;INTERIOR;SIM;SIVEP HOSP;BRASIL;NAO
430005;ÁGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Masculino;40 a 49;TESTE RÁPIDO;24/08/2020;13/08/2020;25/08/2020;;RECUPERADO;NAO;NAO;NAO;NAO;NAO;SIM;;NAO;;27/08/2020;BRANCA;NAO ENCONTRADO;NAO;CENTRO;NAO;E-SUS;BRASIL;NAO
430005;ÁGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Masculino;20 a 29;TESTE RÁPIDO;11/08/2020;31/07/2020;25/08/2020;;RECUPERADO;NAO;NAO;SIM;NAO;NAO;SIM;;NAO;;14/08/2020;INDIGENA;KAINGANG (CAINGANGUE);NAO;INTERIOR;NAO;E-SUS;BRASIL;NAO
430005;ÁGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Feminino;30 a 39;RT-PCR;24/08/2020;21/08/2020;28/08/2020;;RECUPERADO;NAO;NAO;NAO;NAO;NAO;SIM;;NAO;;04/09/2020;INDIGENA;KAINGANG (CAINGANGUE);SIM;INTERIOR;NAO;E-SUS;BRASIL;NAO
430005;ÁGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Masculino;60 a 69;TESTE RÁPIDO;27/08/2020;17/08/2020;28/08/2020;;RECUPERADO;NAO;NAO;NAO;NAO;NAO;SIM;;NAO;;31/08/2020;INDIGENA;KAINGANG (CAINGANGUE);NAO;INTERIOR;NAO;E-SUS;BRASIL;NAO
430005;ÁGUA SANTA;14;PASSO FUNDO - R17 R18 R19;Masculino;70 a 79;RT-PCR;27/08/2020;20/08/2020;01/09/2020;19/10/2020;OBITO;SIM;SIM;SIM;NAO;SIM;NAO;Doença Cardiovascular Crônica;NAO;19/10/2020;03/09/2020;BRANCA;NAO ENCONTRADO;NAO INFORMADO;INERIOR;SIM;SIVEP HOSP;BRASIL;NAO

This can be done within the RStudio terminal pane. You can run the awk script there, using the first one to discover how many records need to be discarded (if too many, then must address converting them) and the second to create the good.txt file to import into R in place of the original.

I normally wouldn't suggest this pre-processing non-R approach, but it vastly simpler than doing the same operation within R.

And Windows. I left it long ago. I think you need Powershell(?) installed and possibly to add awk.

I'm trying to use grepl or str_starts, still figuring it out

What happens if you just do

dat <- read.csv("brazil.csv",  sep  =  ";",  header  =  FALSE )

It looks to mo as if you have a normal .csv file with some wrap-around.

The "header = FALSE " option may or may not be appropriate.

3 Likes

Wow, i think you just solved it all. I can't find a single defective line with your method.
I had used a whole nother method to separate things that kept the defect but it was as simple as that, and you're also right with the FALSE, it needs to be TRUE
Thanks! Now i can fail at the actual data mining instead of formatting

1 Like

:smile:

Best of luck.

BTW, you may want to use "read.csv2()" if you have decimal values in the data set. read.csv assumes that . is the decimal marker so we have pi = 3.1416. I do not know but would you say pi = 3,1416?

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.