Hi there,
I giving my first project using data from work, which I would normally use Excel.
My goal was to create a vector which contained all the column names I would need, dropping necessary variables.
The issue I have encontered is the column names can contain spaces & special characters.
Tried using make.names() to remove spaces and special characters - seemed to work
Based on the new colnames after make.names(), took a glimpse() at the df and using the col names tried to have them saved in a vector, to used to select the desired columns.
df_eta <- read_csv('Shipment_Profile_Reports_ETA_01012017_02182020.csv', col_names = TRUE)
df_etd <- read_csv('Shipment_Profile_Reports_ETD_01012017_02182020.csv', col_names = TRUE)
glimpse(df_eta)
head(df_eta)
glimpse(df_etd)
head(df_etd)
# getting the distinct shipment numbers from both
snum <- full_join(df_eta,df_etd, by = 'Shipment ID') %>%
select('Shipment ID') %>%
distinct()
glimpse(snum)
# Decided to full_join the data
df_all_og <- full_join(df_eta,df_etd)
glimpse(df_all_og)
# Function to remove spaces into "_", not using this for now, using the make.names()
spaceless <- function(x) {colnames(x) <- gsub(" ", "_", colnames(x));x}
df_all <- spaceless(df_all_og)
glimpse(df_all)
# Altering the original column names to what R can use.
# Saving into new dataset
colnames(df_all) <- make.names(colnames(df_all_og))
glimpse(df_all)
# Original names of columns of interest:
# Shipment ID
# Trans
# Mode
# Origin : House Ref (columns range)
# Goods Description : Destination ETA (columns range)
# Added : Direction (column range)
# Total Accural (Recognized+Unrecognized) : Total WIP (Recognized+Unrecognized)
# From the list of columns names, trying to create a vector with only the column names I need for now
# using the already changed "make.names" format.
df_cols <- c(
"Shipment.ID",
"Trans",
"Mode",
"Origin:House_Ref",
"Goods.Description:Destination.ETA",
"Added:Direction",
"Total.Accrual..Recognized.Unrecognized.:Total.WIP..Recognized.Unrecognized."
)
df_cols
df_all <- df_all %>%
select(df_cols)
I ended up getting this error message:
Blockquote Error: Unknown columns
Origin:House_Ref
,Goods.Description:Destination.ETA
,Added:Direction
andTotal.Accrual..Recognized.Unrecognized.:Total.WIP..Recognized.Unrecognized.
Callrlang::last_error()
to see a backtrace
So I not sure what is the best way to handle these column names. The joined dataset "df_all_og" has 149 variables & 43,856 observations. I am trying to get only the observations I believe are pertinent to my analysis.
Also unsure how to proceed and store column rage names in a vector, like "Origin : House Ref " (all columns from Origin to House Ref"
The only work around I can see is to use indexes for the columns, but I've heard repeatedly it is a bad practice so I'm trying to avoid it at all costs.
Could someone please shine some light on best practices when faced with "dirty" column names? All exercises and literature (R for Data Science) have data nice and ready... so this is new for me.
Thank you for your assistance & time.
LF.