Columns per Row

boyddt · August 19, 2018, 5:10am

Thanx to @jcblum I was able to get the headers sorted out on my csv file. The next issue that I'm trying to sort out is that there 39 variables in this file and some rows are missing commas, resulting in rows with less than 39 variables. The worst I've seen so far is 37 variable which means two commas are missing.

I've looked through the " R Programming for Data Science" book but couldn't find anything that was particularly helpful, doesn't mean it isn't there, I just don't know what I'm looking for. What I would like to do is maybe create an index of the lines that are short columns so that they can be flagged and we can track them down manually. I'm working on a small data set right now and our full data set will have some 3 million rows (in 10 different files) and hunting through that has proven to be impossible.

Thanx in advance for any help offered, not looking for a complete solution just some ideas that I can try.

David

jonspring · August 19, 2018, 5:27am

Here’s a thread from Stack Overflow that might help:

I presume the csv file is loading but there are NAs in some of the rightmost columns? If so, this should help count the NAs so you can sort or filter to see the problematic ones.

boyddt · August 19, 2018, 6:57am

Jon, I verified the files as R seems to be adding NAs randomly at the end of the row but not to fill out the row completely. I'm including two files The first one is the original file "participants_small.csv" and "fixed_participants_small.csv". You'll be able to see what is happening. On some rows R will add a single NA and other rows it'll add two or three. Either way there are still missing columns. One of the
columns "Is prior participant" has NAs as default data.

participantDF <-read.csv("/home/david/Dropbox/TUN/participants_small.csv",header = TRUE, stringsAsFactors = FALSE)

names(participantDF) <- gsub("\\.","_", tolower(names(participantDF)))

write.csv(participantDF, "/home/david/Dropbox/TUN/fixed_Participants_small.csv", row.names = FALSE)

txt <- read.csv("/home/david/Dropbox/TUN/fixed_Participants_small.csv",header = TRUE, sep = ",")

#  print(str(txt))
#  print(names(txt))
#  print(summary(txt))
  print(head(txt,10))

<a class="attachment" href="/uploads/default/original/2X/0/039a7dd9237877bbc2240d815bc76bacbca006c9.pdf">fixed_Participants_small.pdf</a> (26.6 KB)
 <a class="attachment" href="/uploads/default/original/2X/0/0295ba1187c448a2b1e0e6ecdda19f90a889c23e.pdf">participants_small.pdf</a> (23.7 KB)

jcblum · August 19, 2018, 10:28am

Unfortunately, file uploads are limited on this forum for security reasons. If you want to post sample CSVs, the best bet is something like Google Drive, Dropbox, or (my personal favorite) a Github Gist.

You might also try importing using the readr package — it automatically creates a separate dataframe of import problems that it encounters, which should let you find the rows with the missing commas (and any other unexpected data weirdness). You’ll want to read the overview before using: https://readr.tidyverse.org/articles/readr.html

If you’re going to be doing a lot of data munging (and honestly, who isn’t these days?), I also recommend this book: https://www.manning.com/books/beyond-spreadsheets-with-r

boyddt · August 31, 2018, 5:07am

@jcblum, finally have time to get back to my project. I created at Github repository and copied a sanitized version of the data file there. You'll see that lines 4, 7 & 12 have the anomaly I'm trying to fix. Two columns are basically merged and there are one fewer columns in those rows than the others.

Thanx in advance for any help you can offer.

gist.github.com

https://gist.github.com/davidthomasboyd/5906ebc2344c79387b8a367354d45e75

participants_private_small.csv

Fiscal Year,Event Date,Participation Type Name,Team Name,Team Creation Date,Team Division,Team ID,Contact ID,Member ID,Participant Accept Email,Registration Date,Registration Active Status,Is Team Captain,Is Secondary Registration,Is Prior Participant,Emails Sent,Total of All Confirmed Gifts($),Total From Participant($),Total Not From Participant($),Number From Participant,Number Not From Participant,Participant Email Status,Participant Employer,Participant Occupation,Registration Type,Event ID,Participant Gender,Participant Goal($),Suggested Participant Goal($),Source Code Type,Source Code Text,Sub Source Code Text
2016,12/31/16 10:36,Super Cyclist,,,,,12495541,84255063,TRUE,8/31/15 13:38,Active,FALSE,FALSE,N/A,0,0,0,0,0,0,Good,,,Online,26846,Male,1095,1095,,,
2016,9/1/16 13:02,Cyclist,,,,,12314585,81917766,TRUE,8/25/15 13:03,Active,FALSE,FALSE,N/A,0,0,0,0,0,0,Good,,,Online,26886,Female,1095,1095,,,
2016,9/1/16 13:02,Cyclist,'Taylor''s Team',9/17/15 10:30,Friends and Family,45106715440936,90424808,TRUE,9/17/15 10:30,Active,TRUE,FALSE,N/A,0,0,0,0,0,0,Good,,,Online,26886,Female,1095,1095,,,
2016,9/1/16 13:02,Cyclist,Team Kate4Test,8/21/15 9:24,Friends and Family,449427,5990607,57348428,TRUE,8/21/15 9:24,Active,TRUE,FALSE,N/A,0,0,0,0,0,0,Good,National MS Society,,Online,26886,Female,1095,1095,,,
2016,9/1/16 13:02,Cyclist,,,,,14568407,89361923,TRUE,8/19/15 13:21,Active,FALSE,FALSE,N/A,0,0,0,0,0,0,Bad (Hard Bounce),,,Online,26886,Male,1095,1095,,,
2016,9/1/16 13:02,Cyclist,,,,,12770939,82684399,TRUE,8/19/15 16:42,Active,FALSE,FALSE,N/A,9,0,0,0,0,0,UnknownBlackbaud,Nonprofit,Online,26886,Male,1095,1095,6,bikeaka.nationalmssociety.org,/site/TR
2016,9/1/16 13:02,Cyclist,,,,,13719781,85497143,TRUE,9/17/15 10:48,Active,FALSE,FALSE,N/A,0,0,0,0,0,0,Unknown,,,Online,26886,Male,1095,1095,,,
2016,9/1/16 13:02,Cyclist,,,,,14637177,89420037,TRUE,9/17/15 10:23,Active,FALSE,FALSE,N/A,8,0,0,0,0,0,Good,,,Online,26886,Female,250,1095,,,
2016,3/19/16 14:39,Cyclist,,,,,1472671989519866,TRUE,10/29/15 15:59,ActiveFALSE,FALSE,N/A,0,0,0,0,0,0,Bad (Hard Bounce),,,,VA,Female,1095,1095,6,s2.userzoom.com,/s/userzoom.aspx,,

This file has been truncated. show original

jcblum · September 6, 2018, 5:17am

Oof, this is a pretty aggravating type of CSV malformation. What I'm seeing here is that not only are there commas missing, different commas are missing in each problem line.

Since I mentioned using read_csv()'s tools to diagnose this sort of thing, here's an example of doing that:

library(tidyverse)

# read_csv charges ahead despite problems...
participants <- read_csv("participants_private_small.csv")

# ...but it keeps a note of all the weirdness it encounters,
# which we can extract into a parallel data frame
import_probs <- problems(participants)

# I can find out how many rows have problems
length(unique(import_probs$row))

# And interactively examine all the imported rows with problems
View(participants %>% slice(import_probs$row))

As I think you realize, for row 3 (of the data — not counting headers), Team ID and Contact ID got merged. For row 6, Participant Email Status and Participant Employer got merged. For row 11, Event ID and Participant Gender got merged. I figured this out by just reading the data — given the number of fields, varied data types, and number of "natural" NAs in the data, it would be a pretty involved project to identify the source of the problem for each row algorithmically.

So I think it all comes down to how many of these problem rows are in your actual dataset, and whether this is a one-time data-munging step or something that might crop up again and again with new datasets.

If it's a one-time thing and the number of problem rows isn't enormous, honestly I'd just use something like my above workflow to diagnose the problems and edit the CSV directly to fix them. Or better yet, figure out what's going wrong upstream in whatever is generating the CSV in the first place!

If you have lots of data and too many problem rows to fix by hand, can you just drop the problem rows from your project? If you expect that this is going to keep happening in future datasets, then I'd really want to invest my time in preventing the CSV from getting borked in the first place, if at all possible.

rensa · September 6, 2018, 6:10am

Did not know about readr::problems. Very cool!

boyddt · September 7, 2018, 4:58am

jcblum, I agree with you about the status of this data set. It is not ours and we have talked to the client about it but they do not have the resources to deal with it right now. The current data set is about 3.5 million rows an what I pulled out was just some of the rows that had issues. I think we're under 0.005% of total rows with issues. Right now I'm hoping that what you provided will get me moving forward. Being able to count the rows with issues will be great help. I had tried using the problems function(?) but the output was difficult to read. Using some of these other features should help tremendously. Hell of a project to cut my teeth on.

Thank you for your continued support,

David