Create a list of column names

Hoping I can get some help here. I'm teaching myself R with some background in vbScript & Powershell. I'm trying to read in a csv file, pull the column names, massage them so that they match pre-defined requirements and then recreate the csv file with the new column names. I'm not looking for someone to write the script but the point I'm struggling with is when I create a dataframe or even and empty variable it seems as though there are characters in it. what I'm getting currently is [1] "" and when I write the variable to the csv I get that added to the file which screws up everything.

Here is the code that I'm working with and like I said I'm not looking for someone to write the script, just an explanation as to how to create an empty list, dataframe or variable that I can use.

rm(tmpcolnames)
partcolnamesDF <- list(read.csv("/home/david/Dropbox/TUN/new_participant_colnames.csv",header=FALSE, sep = ",",stringsAsFactors = FALSE))
tmpcolnames <- ""

#as.character(tmpcolnames[0])
#nchar(as.character(tmpcolnames[0]))
#newrow <- seq(39)
#r <- 0
#print(partcolnamesDF)
#print(tmpcolnames)
for (colname in partcolnamesDF)
{
  if(r < 1){
    tmpcolnames <- paste(tmpcolnames, tolower(colname), sep="")
  }
  else {
    tmpcolnames <- paste(tmpcolnames, tolower(colname), sep=",")
  }
  r <- r +1

}

print(tmpcolnames)

#  write.table(partcolnames, file = "/home/david/Dropbox/TUN/new_participant_lcolnames.txt", row.names=FALSE, sep=",")
write.csv(tmpcolnames, file = "/home/david/Dropbox/TUN/new_participant_lcolnames.txt")
#str(partcolnames)
#print(tmpcolnames)

```R
This is the output to the csv file

"","x"
"1",",security_category_name"
"2",",fiscal_year"
"3",",internal_event_name"
"4",",event_date"
"5",",participation_type_name"
"6",",team_name"
"7",",team_creation_date"
"8",",team_division"
"9",",team_id"
"10",",contact_id"
"11",",member_id"
"12",",participant_accept_email"
"13",",registration_date"
"14",",registration_active_status"
"15",",is_team_captain"
"16",",is_secondary_registration"
"17",",is_prior_participant"
"18",",emails_sent"
"19",",total_of_all_confirmed_gifts($)"
"20",",total_from_participant($)"
"21",",total_not_from_participant($)"
"22",",number_from_participant"
"23",",number_not_from_participant"
"24",",participant_email_status"
"25",",participant_employer"
"26",",participant_occupation"
"27",",participant_connection_to_ms"
"28",",address_participant_state/province"
"29",",address_participant_county"
"30",",address_participant_city"
"31",",address_participant_zip/postal_code"
"32",",registration_type"
"33",",event_id"
"34",",participant_gender"
"35",",participant_goal($)"
"36",",suggested_participant_goal($)"
"37",",source_code_type"
"38",",source_code_text"
"39",",sub_source_code_text"

Hi,
May be you should use tmpcolnames = NULL, instead of assigning "". This will create a empty variable.
Thank you!

Hi! Welcome!

You seem to be doing a very common thing, which is to approach a problem in a new language the way you would solve it in the language you already know. This is an understandable strategy, but it can lead you to do things in an unnecessarily roundabout fashion.

Here's a typical way to approach this task in R:

# This line is only necessary to set up the example CSV!
write.csv(head(iris), "iris.csv", row.names = FALSE)

# Read in a CSV
iris_data <- read.csv("iris.csv", header = TRUE, stringsAsFactors = FALSE)

iris_data
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

# Perform manipulations on column names. Here, we'll convert 
# column names to lowercase and replace periods with underscores
names(iris_data) <- gsub("\\.", "_", tolower(names(iris_data)))

# Write a CSV with updated column names
write.csv(iris_data, "iris2.csv", row.names = FALSE)

Created on 2018-08-13 by the reprex package (v0.2.0).

Here's what iris2.csv looks like:

"sepal_length","sepal_width","petal_length","petal_width","species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
5,3.6,1.4,0.2,"setosa"
5.4,3.9,1.7,0.4,"setosa"

I'm afraid I don't follow this part. I don't see any zero-length strings getting added to your CSV. Instead, your CSV output has row names, but that's just because write.csv() creates row names by default unless you set the row.names parameter to FALSE.

5 Likes

jcblum, thank you for this information, it worked very well. I like the fact that you can do multiple operations on the names all in the same command.

David

You’re welcome! Yes, the abundance of vectorized operations is a really nice thing about R. Getting a handle on how to take advantage of vectorization and all the stuff you get “for free” when using data frames are IMO two top things to focus on when learning how to write idiomatic R, coming from something like VBscript — there’s an awful lot of boilerplate code that you may be used to which you don’t have to write in R :relieved:.

4 Likes

What book(s) would you recommend for learning R?

I would recommend R for Data Science. This gives you a great introduction to R and the tidyverse (which is a collection of packages to make your life easier).

@jcblum, playing with the code you posted and I've run into a stumbling block. My original column names look like this

...Is Prior Participant,Emails Sent,Total of All Confirmed Gifts($),Total From Participant($),Total Not From Participant($),Number From Participant,Number Not From Participant,Participant Email Status,Participant Employer,Participant Occupation,Participant Connection to MS,Address  -  Participant State/Province,Address  -  Participant County,Address  -  Participant City,Address -  Participant ZIP/Postal Code,Registration Type,Event ID,Participant Gender,Participant Goal($),Suggested Participant Goal($),Source Code Type,Source Code Text,Sub Source Code Text

The first couple are OK but when we get to Total of All Confirmed Gifts($) the read.csv command converts the spaces to "." and also the ($) at the end, leaving me with three dots at the end of the name. The code you sent changes all dots to underscores, creating the column names listed below.

"is_prior_participant","emails_sent","total_of_all_confirmed_gifts___","total_from_participant___","total_not_from_participant___","number_from_participant","number_not_from_participant","participant_email_status","participant_employer","participant_occupation","participant_connection_to_ms","address_____participant_state_province","address_____participant_county","address_____participant_city","address____participant_zip_postal_code","registration_type","event_id","participant_gender","participant_goal___","suggested_participant_goal___","source_code_type","source_code_text","sub_source_code_text"

I got past that issue with the following code.

names(participantDF) <- gsub("\\.","_", names(participantDF))
names(participantDF) <- gsub("\\_____","_", names(participantDF))
names(participantDF) <- gsub("\\____","_", names(participantDF))
names(participantDF) <- gsub("\\___","_", names(participantDF))
names(participantDF) <- gsub("\\__","_", tolower(names(participantDF)))

The leaves me with just a few column names with a trailing underscore. I've looked through some of the docs but I can't seem to find the right combination of commands that will allow me to remove the trailing underscore. Any help would be greatly appreciated.

David

There is this useful RStudio add-in that might be helpful for you:

Specifically, your situation can be resolved with anchors. To make regex look at the end of the string you should use $ symbol. So it'll look something like this: gsub("_$", <rest_of_the_command>).

2 Likes

@mishabalyasin, thank you, I will look into it.

David