Handling Column names from DF with spaces.

luisferlante · March 25, 2020, 8:24pm

Hi there,

I giving my first project using data from work, which I would normally use Excel.
My goal was to create a vector which contained all the column names I would need, dropping necessary variables.

The issue I have encontered is the column names can contain spaces & special characters.
Tried using make.names() to remove spaces and special characters - seemed to work
Based on the new colnames after make.names(), took a glimpse() at the df and using the col names tried to have them saved in a vector, to used to select the desired columns.

df_eta <- read_csv('Shipment_Profile_Reports_ETA_01012017_02182020.csv', col_names = TRUE)
df_etd <- read_csv('Shipment_Profile_Reports_ETD_01012017_02182020.csv', col_names = TRUE)

glimpse(df_eta)
head(df_eta)
glimpse(df_etd)
head(df_etd)

# getting the distinct shipment numbers from both
snum <-  full_join(df_eta,df_etd, by = 'Shipment ID') %>% 
  select('Shipment ID') %>% 
  distinct()
glimpse(snum)

# Decided to full_join the data
df_all_og <- full_join(df_eta,df_etd)

glimpse(df_all_og)


# Function to remove spaces into "_", not using this for now, using the make.names() 

spaceless <- function(x) {colnames(x) <- gsub(" ", "_", colnames(x));x}
df_all <- spaceless(df_all_og)
glimpse(df_all)
            


# Altering the original column names to what R can use.
# Saving into new dataset
colnames(df_all) <- make.names(colnames(df_all_og))
glimpse(df_all)


# Original names of columns of interest:
# Shipment ID
# Trans
# Mode
# Origin : House Ref (columns range)
# Goods Description : Destination ETA (columns range)
# Added : Direction (column range)
# Total Accural (Recognized+Unrecognized) : Total WIP (Recognized+Unrecognized)

# From the list of columns names, trying to create a vector with only the column names I need for now
# using the already changed "make.names" format.
df_cols <- c(
  "Shipment.ID",
  "Trans",
  "Mode",
  "Origin:House_Ref",
  "Goods.Description:Destination.ETA",
  "Added:Direction",
  "Total.Accrual..Recognized.Unrecognized.:Total.WIP..Recognized.Unrecognized."
)
df_cols

df_all <-  df_all %>% 
  select(df_cols)

I ended up getting this error message:

Blockquote Error: Unknown columns Origin:House_Ref, Goods.Description:Destination.ETA, Added:Direction and Total.Accrual..Recognized.Unrecognized.:Total.WIP..Recognized.Unrecognized.
Call rlang::last_error() to see a backtrace

So I not sure what is the best way to handle these column names. The joined dataset "df_all_og" has 149 variables & 43,856 observations. I am trying to get only the observations I believe are pertinent to my analysis.

Also unsure how to proceed and store column rage names in a vector, like "Origin : House Ref " (all columns from Origin to House Ref"

The only work around I can see is to use indexes for the columns, but I've heard repeatedly it is a bad practice so I'm trying to avoid it at all costs.

Could someone please shine some light on best practices when faced with "dirty" column names? All exercises and literature (R for Data Science) have data nice and ready... so this is new for me.

Thank you for your assistance & time.
LF.

technocrat · March 25, 2020, 8:36pm

library(tibble)
## syntactic names are easier to work with, though, and you can request them 
df <- tibble(`a 1` = 1, `a 2` = 2, .name_repair = "universal")
#> New names:
#> * `a 1` -> a.1
#> * `a 2` -> a.2

^{Created on 2020-03-25 by the reprex package (v0.3.0)}

luisferlante · March 25, 2020, 8:55pm

Thank you for your reply!

Pardon my stupidity but I'm not quite sure how to use the information you provided.

Should I force my data to be a tibble and repair the names? Doesn't read_csv() make them tibbles in the first place?
i.e:

df_all <- as.tibble(df_all, . name_repair = "universal")

Thought make.names() was doing that.

I was able to get my vector with the correct character strings which name the columns

colnames(df_all) <- make.names(colnames(df_all_og))

And from that "corrected" column names, I re-wrote the ones I need into a vector:

df_cols <- c(
  "Shipment.ID",
  "Trans",
  "Mode",
  "Origin:House_Ref",
  "Goods.Description:Destination.ETA",
  "Added:Direction",
  "Total.Accrual..Recognized.Unrecognized.:Total.WIP..Recognized.Unrecognized."
)

But then I'm not able to use that vector to select the desired columns from original dataset.

df_all <-  df_all %>% 
  select(df_cols)

Which gives me the previously described error.

technocrat · March 25, 2020, 9:48pm

Sorry, misread the question.

Since df_col has syntactical names, you can just

colnames(df_all) <- df_col

Not sure what df_all_og is.

A suggestion. Save df_col and replace the very long variable names with descriptive names that are as short as possible. It will cut down on typos and you can restore the original column names the same way.

luisferlante · March 26, 2020, 2:20am

Firstly, thank you for your assistance.

Every time I read, I think "damn cool nickname!".

Anyways, I don't think I quite explained well what I was trying to do, because I tried what you suggested and I did not get the expected result. But after working with it a little longer I was able to understand it.

# Load csv:
df_eta <- read_csv('Shipment_Profile_Reports_ETA_01012017_02182020.csv', col_names = TRUE)
df_etd <- read_csv('Shipment_Profile_Reports_ETD_01012017_02182020.csv', col_names = TRUE)

# Decided to full_join the data 
df_all_og <- full_join(df_eta,df_etd)
glimpse(df_all_og)

I look through the colnames of df_all_og to determine what would be most pertinent for what I'm trying to achieve.

The actual colnames(df_all_og) is 149 observations long. Too many, lets clean the "trash".

Below the "..." represents the range of columns I want. hence, I want columns 1,2,4,5,6:13,17:19,31:101,120:127.

> colnames(df_all_og)
  [1] "Shipment ID"                                    
  [2] "Trans"                                                                        
  [4] "Mode"                                           
  [5] "Origin"                                         
  [6] "Origin Ctry"                                    
...                              
 [13] "House Ref"                                                                            
 [17] "Goods Description"                              
...                   
 [19] "Destination ETA"                                                    
 [31] "Added"                                          
...              
[101] "Direction" 
[120] "Total Accrual (Recognized+Unrecognized)" 
...                   
[127] "Total WIP (Recognized+Unrecognized)"

I could just do this:

# Create a vector with columns desired:
df_col_num <- c(1,2,4,5,6:13,17:19,31:101,120:127)
# Synthesize dataset to variables of interest to begin EDA.
df_all <-  df_all_og %>% 
  select(df_col_num)

What I was trying to do was this:

df_all <- df_all_og %>% 
  select(
    `Shipment ID`,
    Trans,
    Mode:`House Ref`,
    `Goods Description`:`Destination ETA`,
    Added:Direction,
    starts_with("Total")
  )

From here I can begin the EDA and use dplyr rename functions to change future subsets of this still "large" variable numbers.

Hope this helps any other newbies.
Cheers.

technocrat · March 26, 2020, 2:30am

Great! Thanks for marking your answer as the solution. Most peep are too shy.

(I've used this handle since 1996!)

luisferlante · March 26, 2020, 2:39am

Honestly it does feel a bit as if I just liked my own photo on Instagram

Damnnn... technocrat has been around!

Well cheers mate! You rock helping out, seriously!
Thanks for the support!

system · April 2, 2020, 2:39am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.